Spelling suggestions: "subject:"biology, bioinformatics"" "subject:"biology, ioinformatics""
81 |
StochHMM| A Flexible Hidden Markov Model FrameworkLott, Paul Christian 04 January 2014 (has links)
<p>In the era of genomics, data analysis models and algorithms that provide the means to reduce large complex sets into meaningful information are integral to further our understanding of complex biological systems. Hidden Markov models comprise one such data analysis technique that has become the basis of many bioinformatics tools. Its relative success is primarily due to its conceptually simplicity and robust statistical foundation. Despite being one of the most popular data analysis modeling techniques for classification of linear sequences of data, researchers have few available software options to rapidly implement the necessary modeling framework and algorithms. Most tools are still hand-coded because current implementation solutions do not provide the required ease or flexibility that allows researchers to implement models in non-traditional ways. I have developed a free hidden Markov model C++ library and application, called StochHMM, that provides researchers with the flexibility to apply hidden Markov models to unique sequence analysis problems. It provides researchers the ability to rapidly implement a model using a simple text file and at the same time provide the flexibility to adapt the model in non-traditional ways. In addition, it provides many features that are not available in any current HMM implementation tools, such as stochastic sampling algorithms, ability to link user-defined functions into the HMM framework, and multiple ways to integrate additional data sources together to make better predictions. Using StochHMM, we have been able to rapidly implement models for R-loop prediction and classification of methylation domains. The R-loop predictions uncovered the epigenetic regulatory role of R-loops at CpG promoters and protein coding genes 3' transcription termination. Classification of methylation domains in multiple pluripotent tissues identified epigenetics gene tracks that will help inform our understanding of epigenetic diseases.
|
82 |
Computational detection of tissue-specific cis-regulatory modulesChen, Xiaoyu, 1974- January 2006 (has links)
A cis-regulatory module (CRM) is a DNA region of a few hundred base pairs that consists of clustering of several transcription factor binding sites and regulates the expression of a nearby gene. This thesis presents a new computational approach to CRM detection. / It is believed that tissue-specific CRMs tend to regulate nearby genes in a certain tissue and that they consist of binding sites for transcription factors (TFs) that are also expressed in that tissue. These facts allow us to make use of tissue-specific gene expression data to detect tissue-specific CRMs and improve the specificity of module prediction. / We build a Bayesian network to integrate the sequence information about TF binding sites and the expression information about TFs and regulated genes. The network is then used to infer whether a given genomic region indeed has regulatory activity in a given tissue. A novel EM algorithm incorporating probability tree learning is proposed to train the Bayesian network in an unsupervised way. A new probability tree learning algorithm is developed to learn the conditional probability distribution for a variable in the network that has a large number of hidden variables as its parents. / Our approach is evaluated using biological data, and the results show that it is able to correctly discriminate among human liver-specific modules, erythroid-specific modules, and negative-control regions, even though no prior knowledge about the TFs and the target genes is employed in our algorithm. In a genome-wide scale, our network is trained to identify tissue-specific CRMs in ten tissues. Some known tissue-specific modules are rediscovered, and a set of novel modules are predicted to be related with tissue-specific expression.
|
83 |
Comparative genomics of the Mycobacterium tuberculosis complexMostowy, Serge. January 2005 (has links)
The study of microbial evolution has been recently accelerated by the advent of comparative genomics, an approach enabling investigation of organisms at the whole-genome level. Tools of comparative genomics, including the DNA microarray, have been applied in bacterial genomes towards studying heterogeneity in DNA content, and to monitor global gene expression. When focused upon the study of microbial pathogens, genome analysis has provided unprecedented insight into their evolution, virulence, and host adaptation. Contributing towards this, I herein explore the evolutionary change affecting genomes of the Mycobacterium tuberculosis complex (MTC), a group of closely related bacterial organisms responsible for causing tuberculosis (TB) across a diverse range of mammals. Despite the introduction nearly a century ago of BCG, a family of live attenuated vaccines intentioned on preventing human TB, the uncertainty surrounding its usefulness is punctuated by the reality that TB continues to be responsible for claiming over 2 million lives per year. As pursued throughout this thesis, a precise understanding of the differences in genomic content among the MTC, and its impact on gene expression and biological function, promises to expose underlying mechanisms of TB pathogenesis, and suggest rational approaches towards the design of improved diagnostics and vaccines to prevent disease. / With the availability of whole-genome sequence data and tools of comparative genomics, our publications have advanced the recognition that large sequence polymorphisms (LSPs) deleted from Mycobacterium tuberculosis, the causative agent of TB in humans, serve as accurate markers for molecular epidemiologic assessment and phylogenetic analysis. These LSPs have proven informative both for the types of genes that vary between strains, and for the molecular signatures that characterize different MTC members. Genomic analysis of atypical MTC has revealed their diversity and adaptability, illuminating previously unexpected directions of MTC evolution. As demonstrated from parallel analysis of BCG vaccines, a phylogenetic stratification of genotypes offers a predictive framework upon which to base future genetic and phenotypic studies of the MTC. Overall, the work presented in this thesis has provided unique insights and lessons having direct clinical relevance towards understanding TB pathogenesis and BCG vaccination.
|
84 |
Low-level variant detection in human mitochondrial DNA using the Illumina(RTM) MiSeqtm next-generation sequencing (NGS) platformSmith, Brandon Chase 07 June 2013 (has links)
<p> When challenged by difficult biological samples, the forensic analyst is far more likely to obtain useful data by sequencing the human mitochondrial DNA (mtDNA). Nextgeneration sequencing (NGS) technologies are currently being evaluated by the Forensic Science Program at Western Carolina University for their ability to reliably detect lowlevel variants in mixtures of mtDNA. The sequence profiles for twenty individuals were obtained by sequencing amplified DNA derived from the mitochondrial hypervariable (HV) regions using Sanger methods. Two-person mixtures were then constructed by mixing quantified templates, simulating heteroplasmy at discrete sites and in defined ratios. Libraries of unmixed samples, artificial mixtures, and instrument controls were prepared using Illumina<sup>®</sup> Nextera<sup>®</sup> XT and deep-sequenced on the Illumina<sup>®</sup>MiSeq™. Analysis of NGS data using a novel bioinformatics pipeline indicated that minor variants could be detected at the 5, 2, 1, and 0.5% levels of detection. Additional experiments which examined the occurrence of sequence variation in hair tissue demonstrates that a considerable amount of sequence variation can exist between hairs and other tissues derived from a single donor. </p>
|
85 |
Remote Homology Detection in Proteins Using Graphical ModelsDaniels, Noah Manus 24 July 2013 (has links)
<p> Given the amino acid sequence of a protein, researchers often infer its structure and function by finding homologous, or evolutionarily-related, proteins of known structure and function. Since structure is typically more conserved than sequence over long evolutionary distances, recognizing remote protein homologs from their sequence poses a challenge. </p><p> We first consider all proteins of known three-dimensional structure, and explore how they cluster according to different levels of homology. An automatic computational method reasonably approximates a human-curated hierarchical organization of proteins according to their degree of homology. </p><p> Next, we return to homology prediction, based only on the one-dimensional amino acid sequence of a protein. Menke, Berger, and Cowen proposed a Markov random field model to predict remote homology for beta-structural proteins, but their formulation was computationally intractable on many beta-strand topologies. </p><p> We show two different approaches to approximate this random field, both of which make it computationally tractable, for the first time, on all protein folds. One method simplifies the random field itself, while the other retains the full random field, but approximates the solution through stochastic search. Both methods achieve improvements over the state of the art in remote homology detection for beta-structural protein folds.</p>
|
86 |
Distributed and multiphase inference in theory and practice| Principles, modeling, and computation for high-throughput scienceBlocker, Alexander W. 10 August 2013 (has links)
<p> The rise of high-throughput scientific experimentation and data collection has introduced new classes of statistical and computational challenges. The technologies driving this data explosion are subject to complex new forms of measurement error, requiring sophisticated statistical approaches. Simultaneously, statistical computing must adapt to larger volumes of data and new computational environments, particularly parallel and distributed settings. This dissertation presents several computational and theoretical contributions to these challenges. </p><p> In chapter 1, we consider the problem of estimating the genome-wide distribution of nucleosome positions from paired-end sequencing data. We develop a modeling approach based on nonparametric templates that controls for variability due to enzymatic digestion. We use this to construct a calibrated Bayesian method to detect local concentrations of nucleosome positions. Inference is carried out via a distributed HMC algorithm that scales linearly in complexity with the length of the genome being analyzed. We provide MPI-based implementations of the proposed methods, stand-alone and on Amazon EC2, which can provide inferences on an entire <i>S. cerevisiae</i> genome in less than 1 hour on EC2. </p><p> We then present a method for absolute quantitation from LC-MS/MS proteomics experiments in chapter 2. We present a Bayesian model for the non-ignorable missing data mechanism induced by this technology, which includes an unusual combination of censoring and truncation. We provide a scalable MCMC sampler for inference in this setting, enabling full-proteome analyses using cluster computing environments. A set of simulation studies and actual experiments demonstrate this approach's validity and utility. </p><p> We close in chapter 3 by proposing a theoretical framework for the analysis of preprocessing under the banner of multiphase inference. Preprocessing forms an oft-neglected foundation for a wide range of statistical and scientific analyses. We provide some initial theoretical foundations for this area, including distributed preprocessing, building upon previous work in multiple imputation. We demonstrate that multiphase inferences can, in some cases, even surpass standard single-phase estimators in efficiency and robustness. Our work suggests several paths for further research into the statistical principles underlying preprocessing.</p>
|
87 |
Clustering time-course gene-expression array dataGershman, Jason Andrew January 2008 (has links)
This thesis examines methods used to cluster time-course gene expression array data. In the past decade, various model-based methods have been published and advocated for clustering this type of data in place of classic non-parametric techniques like K-means and hierarchical clustering. On simulated data, where the variance between clusters is large, I show that the model-based MCLUST outperforms model-based SSClust and non-model-based K-means clustering. I also show that the number of genes or the number of clusters has no significant effect on the performance of these model-based clustering techniques. On two real data sets, where the variance between clusters is smaller, I show that model-based SSClust outperforms both MCLUST and K-means clustering. Since the "truth" is often not known for real data sets, I use the clustered data as "truth" and then perturb the data by adding pointwise noise to cluster this noisy data. Throughout my analysis of real and simulated expression data, I use the misclassification rate and the overall success rate as measures of success of the clustering algorithm. Overall, the model-based methods appear to cluster the data better than the non-model-based methods.
Later, I examine the role of gene ontology (GO) and using gene ontology data to cluster gene expression data. I find that clustering expression data, using a synthesis of gene expression and gene ontology not only provides clustering that has a biologic meaning but also clusters the data well. I also introduce an algorithm for clustering expression profiles on both gene expression and gene ontology data when some of the genes are missing the ontology data. Instead of some other methods which ignore the missing data or lump it all into a miscellaneous cluster, I use classification and inferential techniques to cluster using all of the available data and this method shows promising results. I also examine which ontology, among molecular function, biological process, and cellular component, is best in clustering expression data. This analysis shows that biological process is the preferred ontology for clustering expression data.
|
88 |
A Bayesian hierarchical model for detecting associations between haplotypes and disease using unphased SNPsFox, Garrett Reed January 2008 (has links)
This thesis addresses using haplotypes to detect disease predisposing chromosomal regions based on a Bayesian hierarchical model for case-control data. By utilizing the Stochastic Search Variable Selection (SSVS) procedure of George and McCulloch (1997), the number of parameters is riot constrained by the sample size, as are the frequentist methods. Haplotype information is used in the form of estimated haplotype frequencies, and using these values in the model as if they were the true population frequencies. A Bayesian hierarchical probit model was developed by estimating the distribution of haplotype pairs for an individual based on these estimated populaltion frequencies and using SSVS to make decisions about model selection. To date, Bayesian models for haplotype based case-control data assume either the haplotypes are known, or that haplotypes can be clustered such that every haplotype within a cluster has the same effect on disease status. A simulation was performed analyzing the testing properties of this Bayesian model and comparing it to a popular frequentist method (Schaid, 2002). Both real genotype data from the Dallas Heart Study (DHS) and simulated data were used to study the operating characteristics of the new model The Bayesian method is shown to have higher power than Schaid's frequentist method when there are a limited number of common haplotypes in a region, a situation that appears to be common (Gabriel, 2002). An approach based on the maximum of Chi-squared statistics at each marker locus performed suprisingly well against both haplotype methods in various cases. These simulations contribute to the ongoing debate on the efficacy of haplotype methods. The most suprising result was the ability of the genotype methods to outperform the haplotype methods in various instances where there were cis-acting interactions. The Bayesian haplotype method performed better in comparison when dealing with low penetrance in highly conserved blocks. Additionally, a set of simulations were based on a number of genes from the DHS data set with multiple haplotype block regions. This demonstrated the similarities of the haplotype methods and the added flexibility when analyzing posterior distributions. We also demonstrate that interactions between loci in separate blocks can be detected without having interaction terms in the regression model. Future work should focus on more efficient methods of detecting these and other complex interactions.
|
89 |
The vervet regulator of G protein signaling 4 (RGS4) gene, a candidate gene for quantifiable behavioral dimensions associated with psychopathology : sequence, bioinformatic analysis, and association study of a novel polymorphism with social isolationTrakadis, John January 2004 (has links)
Regulators of G-protein coupled signaling (RGS) accelerate GTP hydrolysis and consequently influence signal termination. The RGS-4 gene has recently been reported to be implicated in a wide range of neuropsychiatric disorders including schizophrenia, Alzheimer's disease and addictions. / In this study, the vervet RGS-4 gene was sequenced on a CEQ 8000 genetic analysis system (Beckman Coulter) and characterized using molecular and bioinformatic tools. The obtained vervet sequence overall showed 95.3% sequence identity with the human RGS4 gene. / Thereafter, SNPs in the region encompassing the proximal promoter, exon 1 and the first 450 bp of intron 1 were identified by direct sequencing of 8 unrelated individuals. One of the identified SNPs, +35 [A/G], was genotyped in 155 juvenile vervets previously phenotyped for personality traits, including social isolation. Although preliminary association analysis fails to attain statistical significance (p=0.074), the sample size is small. Additional genotyping of phenotypically defined individuals needs to be undertaken.
|
90 |
Algorithms and statistics for the detection of binding sites in coding regionsChen, Hui, 1974- January 2006 (has links)
This thesis deals with the problem of detecting binding sites in coding regions. A new comparative analysis method is developed by improving an existing method called COSMO. / The inter-species sequence conservation observed in coding regions may be the result of two types of selective pressure: the selective pressure on the protein encoded and, sometimes, the selective pressure on the binding sites. To predict some region in coding regions as a binding site, one needs to make sure that the conservation observed in this region is not due to the selective pressure on the protein encoded. To achieve this, COSMO built a null model with only the selective pressure on the protein encoded and computed p-values for the observed conservation scores, conditional on the fixed set of amino acids observed at the leaves. / It is believed, however, that the selective pressure on the protein assumed in COSMO is overly strong. Consequently, some interesting regions may be left undetected. In this thesis, a new method, COSMO-2, is developed to relax this assumption. / The amino acids are first classified into a fixed number of overlapping functional classes by applying an expectation maximization algorithm on a protein database. Two probabilities for each gene position are then calculated: (i) the probability of observing a certain degree of conservation in the orthologous sequences generated under each class in the null model (i.e. the p-value of the observed conservation under each class); and (ii) the probability that the codon column associated with that gene position belongs to each class. The p-value of the observed conservation for each gene position is the sum of the products of the two probabilities for all classes. Regions with low p-values are identified as potential binding sites. / Five sets of orthologous genes are analyzed using COSMO-2. The results show that COSMO-2 can detect the interesting regions identified by COSMO and can detect more interesting regions than COSMO in some cases.
|
Page generated in 0.0719 seconds