371 |
Computational studies on protein similarity, specificity and designSisu, Cristina Smaranda Domnica January 2011 (has links)
No description available.
|
372 |
StochHMM| A Flexible Hidden Markov Model FrameworkLott, Paul Christian 04 January 2014 (has links)
<p>In the era of genomics, data analysis models and algorithms that provide the means to reduce large complex sets into meaningful information are integral to further our understanding of complex biological systems. Hidden Markov models comprise one such data analysis technique that has become the basis of many bioinformatics tools. Its relative success is primarily due to its conceptually simplicity and robust statistical foundation. Despite being one of the most popular data analysis modeling techniques for classification of linear sequences of data, researchers have few available software options to rapidly implement the necessary modeling framework and algorithms. Most tools are still hand-coded because current implementation solutions do not provide the required ease or flexibility that allows researchers to implement models in non-traditional ways. I have developed a free hidden Markov model C++ library and application, called StochHMM, that provides researchers with the flexibility to apply hidden Markov models to unique sequence analysis problems. It provides researchers the ability to rapidly implement a model using a simple text file and at the same time provide the flexibility to adapt the model in non-traditional ways. In addition, it provides many features that are not available in any current HMM implementation tools, such as stochastic sampling algorithms, ability to link user-defined functions into the HMM framework, and multiple ways to integrate additional data sources together to make better predictions. Using StochHMM, we have been able to rapidly implement models for R-loop prediction and classification of methylation domains. The R-loop predictions uncovered the epigenetic regulatory role of R-loops at CpG promoters and protein coding genes 3' transcription termination. Classification of methylation domains in multiple pluripotent tissues identified epigenetics gene tracks that will help inform our understanding of epigenetic diseases.
|
373 |
Computational detection of tissue-specific cis-regulatory modulesChen, Xiaoyu, 1974- January 2006 (has links)
A cis-regulatory module (CRM) is a DNA region of a few hundred base pairs that consists of clustering of several transcription factor binding sites and regulates the expression of a nearby gene. This thesis presents a new computational approach to CRM detection. / It is believed that tissue-specific CRMs tend to regulate nearby genes in a certain tissue and that they consist of binding sites for transcription factors (TFs) that are also expressed in that tissue. These facts allow us to make use of tissue-specific gene expression data to detect tissue-specific CRMs and improve the specificity of module prediction. / We build a Bayesian network to integrate the sequence information about TF binding sites and the expression information about TFs and regulated genes. The network is then used to infer whether a given genomic region indeed has regulatory activity in a given tissue. A novel EM algorithm incorporating probability tree learning is proposed to train the Bayesian network in an unsupervised way. A new probability tree learning algorithm is developed to learn the conditional probability distribution for a variable in the network that has a large number of hidden variables as its parents. / Our approach is evaluated using biological data, and the results show that it is able to correctly discriminate among human liver-specific modules, erythroid-specific modules, and negative-control regions, even though no prior knowledge about the TFs and the target genes is employed in our algorithm. In a genome-wide scale, our network is trained to identify tissue-specific CRMs in ten tissues. Some known tissue-specific modules are rediscovered, and a set of novel modules are predicted to be related with tissue-specific expression.
|
374 |
Comparative genomics of the Mycobacterium tuberculosis complexMostowy, Serge. January 2005 (has links)
The study of microbial evolution has been recently accelerated by the advent of comparative genomics, an approach enabling investigation of organisms at the whole-genome level. Tools of comparative genomics, including the DNA microarray, have been applied in bacterial genomes towards studying heterogeneity in DNA content, and to monitor global gene expression. When focused upon the study of microbial pathogens, genome analysis has provided unprecedented insight into their evolution, virulence, and host adaptation. Contributing towards this, I herein explore the evolutionary change affecting genomes of the Mycobacterium tuberculosis complex (MTC), a group of closely related bacterial organisms responsible for causing tuberculosis (TB) across a diverse range of mammals. Despite the introduction nearly a century ago of BCG, a family of live attenuated vaccines intentioned on preventing human TB, the uncertainty surrounding its usefulness is punctuated by the reality that TB continues to be responsible for claiming over 2 million lives per year. As pursued throughout this thesis, a precise understanding of the differences in genomic content among the MTC, and its impact on gene expression and biological function, promises to expose underlying mechanisms of TB pathogenesis, and suggest rational approaches towards the design of improved diagnostics and vaccines to prevent disease. / With the availability of whole-genome sequence data and tools of comparative genomics, our publications have advanced the recognition that large sequence polymorphisms (LSPs) deleted from Mycobacterium tuberculosis, the causative agent of TB in humans, serve as accurate markers for molecular epidemiologic assessment and phylogenetic analysis. These LSPs have proven informative both for the types of genes that vary between strains, and for the molecular signatures that characterize different MTC members. Genomic analysis of atypical MTC has revealed their diversity and adaptability, illuminating previously unexpected directions of MTC evolution. As demonstrated from parallel analysis of BCG vaccines, a phylogenetic stratification of genotypes offers a predictive framework upon which to base future genetic and phenotypic studies of the MTC. Overall, the work presented in this thesis has provided unique insights and lessons having direct clinical relevance towards understanding TB pathogenesis and BCG vaccination.
|
375 |
In silico approaches to investigating mechanisms of gene regulationHo Sui, Shannan Janelle 05 1900 (has links)
Identification and characterization of regions influencing the precise spatial and temporal expression of genes is critical to our understanding of gene regulatory networks. Connecting transcription factors to the cis-regulatory elements that they bind and regulate remains a challenging problem in computational biology. The rapid accumulation of whole genome sequences and genome-wide expression data, and advances in alignment algorithms and motif-finding methods, provide opportunities to tackle the important task of dissecting how genes are regulated.
Genes exhibiting similar expression profiles are often regulated by common transcription factors. We developed a method for identifying statistically over-represented regulatory motifs in the promoters of co-expressed genes using weight matrix models representing the specificity of known factors. Application of our methods to yeast fermenting in grape must revealed elements that play important roles in utilizing carbon sources. Extension of the method to metazoan genomes via incorporation of comparative sequence analysis facilitated identification of functionally relevant binding sites for sets of tissue-specific genes, and for genes showing similar expression in large-scale expression profiling studies. Further extensions address alternative promoters for human genes and coordinated binding of multiple transcription factors to cis-regulatory modules.
Sequence conservation reveals segments of genes of potential interest, but the degree of sequence divergence among human genes and their orthologous sequences varies widely. Genes with a small number of well-distinguished, highly conserved non-coding elements proximal to the transcription start site may be well-suited for targeted laboratory promoter characterization studies. We developed a “regulatory resolution” score to prioritize lists of genes for laboratory gene regulation studies based on the conservation profile of their promoters. Additionally, genome-wide comparisons of vertebrate genomes have revealed surprisingly large numbers of highly conserved non-coding elements (HCNEs) that cluster nearby to genes associated with transcription and development. To further our understanding of the genomic organization of regulatory regions, we developed methods to identify HCNEs in insects. We find that HCNEs in insects have similar function and organization as their vertebrate counterparts. Our data suggests that microsynteny in insects has been retained to keep large arrays of HCNEs intact, forming genomic regulatory blocks that surround the key developmental genes they regulate.
|
376 |
Robust genotype classification using dynamic variable selectionPodder, Mohua 11 1900 (has links)
Single nucleotide polymorphisms (SNPs) are DNA sequence variations, occurring when a single nucleotide –A, T, C or G – is altered. Arguably, SNPs account for more than 90% of human genetic variation. Dr. Tebbutt's laboratory has developed a highly redundant SNP genotyping assay consisting of multiple probes with signals from multiple channels for a single SNP, based on arrayed primer extension (APEX). The strength of this platform is its unique redundancy having multiple probes for a single SNP. Using this microarray platform, we have developed fully-automated genotype calling algorithms based on linear models for individual probe signals and using dynamic variable selection at the prediction level. The algorithms combine separate analyses based on the multiple probe sets to give a final confidence score for each candidate genotypes.
Our proposed classification model achieved an accuracy level of >99.4% with 100% call rate for the SNP genotype data which is comparable with existing genotyping technologies. We discussed the appropriateness of the proposed model related to other existing high-throughput genotype calling algorithms.
In this thesis we have explored three new ideas for classification with high dimensional data: (1) ensembles of various sets of predictors with built-in dynamic property; (2) robust classification at the prediction level; and (3) a proper confidence measure for dealing with failed predictor(s).
We found that a mixture model for classification provides robustness against outlying values of the explanatory variables. Furthermore, the algorithm chooses among different sets of explanatory variables in a dynamic way, prediction by prediction. We analyzed several data sets, including real and simulated samples to illustrate these features. Our model-based genotype calling algorithm captures the redundancy in the system considering all the underlying probe features of a particular SNP, automatically down-weighting any ‘bad data’ corresponding to image artifacts on the microarray slide or failure of a specific chemistry.
Though motivated by this genotyping application, the proposed methodology would apply to other classification problems where the explanatory variables fall naturally into groups or outliers in the explanatory variables require variable selection at the prediction stage for robustness.
|
377 |
Low-level variant detection in human mitochondrial DNA using the Illumina(RTM) MiSeqtm next-generation sequencing (NGS) platformSmith, Brandon Chase 07 June 2013 (has links)
<p> When challenged by difficult biological samples, the forensic analyst is far more likely to obtain useful data by sequencing the human mitochondrial DNA (mtDNA). Nextgeneration sequencing (NGS) technologies are currently being evaluated by the Forensic Science Program at Western Carolina University for their ability to reliably detect lowlevel variants in mixtures of mtDNA. The sequence profiles for twenty individuals were obtained by sequencing amplified DNA derived from the mitochondrial hypervariable (HV) regions using Sanger methods. Two-person mixtures were then constructed by mixing quantified templates, simulating heteroplasmy at discrete sites and in defined ratios. Libraries of unmixed samples, artificial mixtures, and instrument controls were prepared using Illumina<sup>®</sup> Nextera<sup>®</sup> XT and deep-sequenced on the Illumina<sup>®</sup>MiSeq™. Analysis of NGS data using a novel bioinformatics pipeline indicated that minor variants could be detected at the 5, 2, 1, and 0.5% levels of detection. Additional experiments which examined the occurrence of sequence variation in hair tissue demonstrates that a considerable amount of sequence variation can exist between hairs and other tissues derived from a single donor. </p>
|
378 |
Remote Homology Detection in Proteins Using Graphical ModelsDaniels, Noah Manus 24 July 2013 (has links)
<p> Given the amino acid sequence of a protein, researchers often infer its structure and function by finding homologous, or evolutionarily-related, proteins of known structure and function. Since structure is typically more conserved than sequence over long evolutionary distances, recognizing remote protein homologs from their sequence poses a challenge. </p><p> We first consider all proteins of known three-dimensional structure, and explore how they cluster according to different levels of homology. An automatic computational method reasonably approximates a human-curated hierarchical organization of proteins according to their degree of homology. </p><p> Next, we return to homology prediction, based only on the one-dimensional amino acid sequence of a protein. Menke, Berger, and Cowen proposed a Markov random field model to predict remote homology for beta-structural proteins, but their formulation was computationally intractable on many beta-strand topologies. </p><p> We show two different approaches to approximate this random field, both of which make it computationally tractable, for the first time, on all protein folds. One method simplifies the random field itself, while the other retains the full random field, but approximates the solution through stochastic search. Both methods achieve improvements over the state of the art in remote homology detection for beta-structural protein folds.</p>
|
379 |
Distributed and multiphase inference in theory and practice| Principles, modeling, and computation for high-throughput scienceBlocker, Alexander W. 10 August 2013 (has links)
<p> The rise of high-throughput scientific experimentation and data collection has introduced new classes of statistical and computational challenges. The technologies driving this data explosion are subject to complex new forms of measurement error, requiring sophisticated statistical approaches. Simultaneously, statistical computing must adapt to larger volumes of data and new computational environments, particularly parallel and distributed settings. This dissertation presents several computational and theoretical contributions to these challenges. </p><p> In chapter 1, we consider the problem of estimating the genome-wide distribution of nucleosome positions from paired-end sequencing data. We develop a modeling approach based on nonparametric templates that controls for variability due to enzymatic digestion. We use this to construct a calibrated Bayesian method to detect local concentrations of nucleosome positions. Inference is carried out via a distributed HMC algorithm that scales linearly in complexity with the length of the genome being analyzed. We provide MPI-based implementations of the proposed methods, stand-alone and on Amazon EC2, which can provide inferences on an entire <i>S. cerevisiae</i> genome in less than 1 hour on EC2. </p><p> We then present a method for absolute quantitation from LC-MS/MS proteomics experiments in chapter 2. We present a Bayesian model for the non-ignorable missing data mechanism induced by this technology, which includes an unusual combination of censoring and truncation. We provide a scalable MCMC sampler for inference in this setting, enabling full-proteome analyses using cluster computing environments. A set of simulation studies and actual experiments demonstrate this approach's validity and utility. </p><p> We close in chapter 3 by proposing a theoretical framework for the analysis of preprocessing under the banner of multiphase inference. Preprocessing forms an oft-neglected foundation for a wide range of statistical and scientific analyses. We provide some initial theoretical foundations for this area, including distributed preprocessing, building upon previous work in multiple imputation. We demonstrate that multiphase inferences can, in some cases, even surpass standard single-phase estimators in efficiency and robustness. Our work suggests several paths for further research into the statistical principles underlying preprocessing.</p>
|
380 |
A Global Mapping of Protein Complexes in S. cerevisiaeVlasblom, James 13 August 2013 (has links)
Systematic identification of protein-protein interactions (PPIs) on a genome scale has become an important focus of biology, as the majority of cellular functions are mediated by these interactions. Several high throughput experimental techniques have emerged as effective tools for querying the protein-protein interactome and can be broadly categorized into those that detect direct, physical protein-protein interactions and those that yield information on the composition of protein complexes. Tandem affinity purification followed by mass spectrometry (TAP/MS) is an example of the latter that identifies proteins that co-purify with a given tagged query (bait) protein.
Though TAP/MS enables these co-complexed associations to be identified on a proteome scale, the amount of data generated by the systematic querying of thousands of proteins can be extremely large. Data from multiple purifications are combined to form a very large network of proteins linked by edges whenever the corresponding pairs might form an association. Only a fraction of these pairwise associations correspond to physical interactions, however, and further computational analysis is necessary to filter out non-specific associations.
This thesis examines how differing computational procedures for the analysis of TAP/MS data can affect the final PPI network, and outlines a procedure to accurately identify protein complexes from data consolidated from multiple proteome-scale TAP/MS experiments in the budding yeast \textit{Saccharomyces cerevisiae}. In collaboration with the Greenblatt and Emili laboratories at the University of Toronto, this methodology was extended to yeast membrane proteins to derive a comprehensive network of 13,343 PPIs and 720 protein complexes spanning both membrane and non-membrane proteins.
|
Page generated in 0.029 seconds