Global ETD Search

21	A Bayesian chromosome painting approach to detect signals of incomplete positive selection in sequence data : applications to 1000 genomes Gamble, Christopher Thomas January 2014 (has links) Methods to detect patterns of variation associated with ongoing positive selection often focus on identifying regions of the genome with extended haplotype homozygosity - indicative of recently shared ancestry. Whilst these have been shown to be powerful they have two major challenges. First, these methods are constructed to detect variation associated with a classical selective sweep; a single haplotype background gets swept up to a higher than expected frequency given its age. Recently studies have shown that other forms of positive selection, e.g. selection on standing variation, may be more prevalent than previous thought. Under such evolution, a mutation that is already segregating in the population becomes beneficial, possibly as a result of an environmental change. The second challenge with these methods is that they base their inference on non-parametric tests of significance which can result in uncontrolled false positive rates. We tackle these problems using two approaches. First, by exploiting a widely used model in population genomics we construct a new approach to detect regions where a subset of the chromosomes are much more related than expected genome-wide. Using this metric we show that it is sensitive to both classical selective sweeps, and to soft selective sweeps, e.g. selection on standing variation. Second, building on existing methods, we construct a Bayesian test which bi-partitions chromosomes at every position based on their allelic type and tests for association between chromosomes carrying one allele and significantly reduced time to common ancestor. Using simulated data we show that this approach results in a powerful, fast, and robust approach to detect signals of positive selection in sequence data. Moreover by comparing our model to existing techniques we show that we have similar power to detect recent classical selective sweeps, and considerably greater power to detect soft selective sweeps. We apply our method, ABACUS, to three human populations using data from the 1000 Genome Project. Using existing and novel candidates of positive selection, we show that the results between ABACUS and existing methods are comparable in regions of classical selection, and are arguably superior in regions that show evidence for recent selection on standing variation. 572.8
22	Factorial Hidden Markov Models for full and weakly supervised supertagging Ramanujam, Srivatsan 2009 August 1900 (has links) For many sequence prediction tasks in Natural Language Processing, modeling dependencies between individual predictions can be used to improve prediction accuracy of the sequence as a whole. Supertagging, involves assigning lexical entries to words based on lexicalized grammatical theory such as Combinatory Categorial Grammar (CCG). Previous work has used Bayesian HMMs to learn taggers for both POS tagging and supertagging separately. Modeling them jointly has the potential to produce more robust and accurate supertaggers trained with less supervision and thereby potentially help in the creation of useful models for new languages and domains. Factorial Hidden Markov Models (FHMM) support joint inference for multiple sequence prediction tasks. Here, I use them to jointly predict part-of-speech tag and supertag sequences with varying levels of supervision. I show that supervised training of FHMM models improves performance compared to standard HMMs, especially when labeled training material is scarce. Secondly, FHMMs trained from tag dictionaries rather than labeled examples also perform better than a standard HMM. Finally, I show that an FHMM and a maximum entropy Markov model can complement each other in a single step co-training setup that improves the performance of both models when there is limited labeled training material available. / text Hidden Markov Models Bayesian Models Categorial Grammar Supertagging Joint Inference
23	Family of Hidden Markov Models and its applications to phylogenetics and metagenomics Nguyen, Nam-phuong Duc 24 October 2014 (has links) A Profile Hidden Markov Model (HMM) is a statistical model for representing a multiple sequence alignment (MSA). Profile HMMs are important tools for sequence homology detection and have been used in wide a range of bioinformatics applications including protein structure prediction, remote homology detection, and sequence alignment. Profile HMM methods result in accurate alignments on datasets with evolutionarily similar sequences; however, I will show that on datasets with evolutionarily divergent sequences, the accuracy of HMM-based methods degrade. My dissertation presents a new statistical model for representing an MSA by using a set of HMMs. The family of HMM (fHMM) approach uses multiple HMMs instead of a single HMM to represent an MSA. I present a new algorithm for sequence alignment using the fHMM technique. I show that using the fHMM technique for sequence alignment results in more accurate alignments than the single HMM approach. As sequence alignment is a fundamental step in many bioinformatics pipelines, improvements to sequence alignment result in improvements across many different fields. I show the applicability of fHMM to three specific problems: phylogenetic placement, taxonomic profiling and identification, and MSA estimation. In phylogenetic placement, the problem addressed is how to insert a query sequence into an existing tree. In taxonomic identification and profiling, the problems addressed are how to taxonomically classify a query sequence, and how to estimate a taxonomic profile on a set of sequences. Finally, both profile HMM and fHMM require a backbone MSA as input in order to align the query sequences. In MSA estimation, the problem addressed is how to estimate a ``de novo'' MSA without the use of an existing backbone alignment. For each problem, I present a software pipeline that implements the fHMM specifically for that domain: SEPP for phylogenetic placement, TIPP for taxonomic profiling and identification, and UPP for MSA estimation. I show that SEPP has improved accuracy compared to the single HMM approach. I also show that SEPP results in more accurate phylogenetic placements compared to existing placement methods, and SEPP is more computationally efficient, both in peak memory usage and running time. I show that TIPP more accurately classifies novel sequences compared to the single HMM approach, and TIPP estimates more accurate taxonomic profiles than leading methods on simulated metagenomic datasets. I show how UPP can estimate ``de novo'' alignments using fHMM. I present results that show UPP is more accurate and efficient than existing alignment methods, and estimates accurate alignments and trees on datasets containing both full-length and fragmentary sequences. Finally, I show that UPP can estimate a very accurate alignment on a dataset with 1,000,000 sequences in less than 2 days without the need of a supercomputer. / Computer Sciences / text Metagenomics Hidden Markov models Phylogenetics Multiple sequence alignment
24	Etude probabiliste et statistique des grandes bases de données. / Probabilistic and statistical study of large databases. Low-Kam, Cécile 07 December 2010 (has links) Cette thèse se situe à l'interface de la statistique et de la fouille de données. Elle est composée de trois parties indépendantes. Dans la première, nous cherchons à estimer l'ordre (le nombre d'États cachés) d'un modèle de Markov caché dont la distribution d'émission appartient à la famille exponentielle. Nous nous plaçons dans le cas où aucune borne supérieure sur cet ordre n'est connue a priori. Nous définissons deux estimateurs pénalisés pour cet ordre, l'un basé sur le maximum de vraisemblance et l'autre sur une statistique de mélange bayésien. Nous montrons la consistance forte de ces estimateurs. Dans la deuxième partie, nous extrayons des motifs séquentiels dont la fréquence est exceptionnellement élevée par rapport à un modèle de Markov. L'approche consiste à dénombrer dynamiquement toutes les positions possibles d'un motif au sein d'une séquence. Puis la fréquence observée est comparée à la fréquence attendue à l'aide d'un test binomial. Une procédure est utilisée pour tenir compte des tests multiples. Des expérimentations sont menées sur des bases synthétiques et des séquences de protéines. Enfin, dans la troisième partie, nous nous intéressons au calcul de l'estimateur à noyau de la densité. Les observations sont regroupées dans des structures hiérarchiques d'arbres binaires. Les calculs sont réalisés sur les nœuds, plutôt que sur les points, pour une plus grande efficacité. Nous effectuons le calcul sur un Échantillon de points de chaque nœud, au lieu de sa totalité, en utilisant des inégalités de concentration non-paramétriques pour contrôler l'erreur. Puis, nous proposons un nouveau parcours de l'arbre pour effectuer ces échantillonnages sur un nombre réduit de nœuds. Nous testons notre approche sur des jeux de données synthétiques. / This Ph.D thesis lies at the interface of statistics and data mining. It contains three independent parts. In the first one, we aim at estimating the order (the number of hidden states) of a Hidden Markov Model, whose emission distribution belongs to the exponential family. We suppose that no upper bound is known on this order. We define two penalised estimators for this order, one based on the maximum likelihood, an the other on a bayesian mixture statistic. We prove that both estimators are strongly consistent. In the second part, we extract sequential patterns of exceptional frequency given a Markov model. We first dynamically enumerate all the possible occurences of a pattern in a sequence. Then, the observed frequency is compared to the expected frequency using a binomial test. Multiple testing is taken into account. Experiments are led on synthetic databases and protein sequences. Finally, in the third chapter, we are interested in kernel density estimation. The observations are gathered in hierarchical structures called binary trees. Computations are done on nodes of trees, rather than on raw observations, for greater efficiency. We only take into account samples on each node, instead of all the observations, using a non-parametric concentration inequality to control the error. We also propose to only browse some parts of the tree. We test our approach on synthetic datasets. Statistique Fouille de données Modèles de Markov Statistics Data mining Markov models
25	Equivalence and Reduction of Hidden Markov Models Balasubramanian, Vijay 01 January 1993 (has links) This report studies when and why two Hidden Markov Models (HMMs) may represent the same stochastic process. HMMs are characterized in terms of equivalence classes whose elements represent identical stochastic processes. This characterization yields polynomial time algorithms to detect equivalent HMMs. We also find fast algorithms to reduce HMMs to essentially unique and minimal canonical representations. The reduction to a canonical form leads to the definition of 'Generalized Markov Models' which are essentially HMMs without the positivity constraint on their parameters. We discuss how this generalization can yield more parsimonious representations of stochastic processes at the cost of the probabilistic interpretation of the model parameters. Hideen Markov Models minimazation statistical modelling sstochastic processes
26	Enhancements to Hidden Markov Models for Gene Finding and Other Biological Applications Vinar, Tomas January 2005 (has links) In this thesis, we present enhancements of hidden Markov models for the problem of finding genes in DNA sequences. Genes are the parts of DNA that serve as a template for synthesis of proteins. Thus, gene finding is a crucial step in the analysis of DNA sequencing data. <br /><br /> Hidden Markov models are a key tool used in gene finding. Yhis thesis presents three methods for extending the capabilities of hidden Markov models to better capture the statistical properties of DNA sequences. In all three, we encounter limiting factors that lead to trade-offs between the model accuracy and those limiting factors. <br /><br /> First, we build better models for recognizing biological signals in DNA sequences. Our new models capture non-adjacent dependencies within these signals. In this case, the main limiting factor is the amount of training data: more training data allows more complex models. Second, we design methods for better representation of length distributions in hidden Markov models, where we balance the accuracy of the representation against the running time needed to find genes in novel sequences. Finally, we show that creating hidden Markov models with complex topologies may be detrimental to the prediction accuracy, unless we use more complex prediction algorithms. However, such algorithms require longer running time, and in many cases the prediction problem is NP-hard. For gene finding this means that incorporating some of the prior biological knowledge into the model would require impractical running times. However, we also demonstrate that our methods can be used for solving other biological problems, where input sequences are short. <br /><br /> As a model example to evaluate our methods, we built a gene finder ExonHunter that outperforms programs commonly used in genome projects. Computer Science gene finding hidden Markov models probabilistic modeling
27	Design and Evaluation of a Presentation Maestro: Controlling Electronic Presentations Through Gesture Fourney, Adam January 2009 (has links) Gesture-based interaction has long been seen as a natural means of input for electronic presentation systems; however, gesture-based presentation systems have not been evaluated in real-world contexts, and the implications of this interaction modality are not known. This thesis describes the design and evaluation of Maestro, a gesture-based presentation system which was developed to explore these issues. This work is presented in two parts. The first part describes Maestro's design, which was informed by a small observational study of people giving talks; and Maestro's evaluation, which involved a two week field study where Maestro was used for lecturing to a class of approximately 100 students. The observational study revealed that presenters regularly gesture towards the content of their slides. As such, Maestro supports several gestures which operate directly on slide content (e.g., pointing to a bullet causes it to be highlighted). The field study confirmed that audience members value these content-centric gestures. Conversely, the use of gestures for navigating slides is perceived to be less efficient than the use of a remote. Additionally, gestural input was found to result in a number of unexpected side effects which may hamper the presenter's ability to fully engage the audience. The second part of the thesis presents a gesture recognizer based on discrete hidden Markov models (DHMMs). Here, the contributions lie in presenting a feature set and a factorization of the standard DHMM observation distribution, which allows modeling of a wide range of gestures (e.g., both one-handed and bimanual gestures), but which uses few modeling parameters. To establish the overall robustness and accuracy of the recognition system, five new users and one expert were asked to perform ten instances of each gesture. The system accurately recognized 85% of gestures for new users, increasing to 96% for the expert user. In both cases, false positives accounted for fewer than 4% of all detections. These error rates compare favourably to those of similar systems. Gesture-based interface electronic presentation hidden Markov models Computer Science
28	Improvements in the Accuracy of Pairwise Genomic Alignment Hudek, Alexander Karl January 2010 (has links) Pairwise sequence alignment is a fundamental problem in bioinformatics with wide applicability. This thesis presents three new algorithms for this well-studied problem. First, we present a new algorithm, RDA, which aligns sequences in small segments, rather than by individual bases. Then, we present two algorithms for aligning long genomic sequences: CAPE, a pairwise global aligner, and FEAST, a pairwise local aligner. RDA produces interesting alignments that can be substantially different in structure than traditional alignments. It is also better than traditional alignment at the task of homology detection. However, its main negative is a very slow run time. Further, although it produces alignments with different structure, it is not clear if the differences have a practical value in genomic research. Our main success comes from our local aligner, FEAST. We describe two main improvements: a new more descriptive model of evolution, and a new local extension algorithm that considers all possible evolutionary histories rather than only the most likely. Our new model of evolution provides for improved alignment accuracy, and substantially improved parameter training. In particular, we produce a new parameter set for aligning human and mouse sequences that properly describes regions of weak similarity and regions of strong similarity. The second result is our new extension algorithm. Depending on heuristic settings, our new algorithm can provide for more sensitivity than existing extension algorithms, more specificity, or a combination of the two. By comparing to CAPE, our global aligner, we find that the sensitivity increase provided by our local extension algorithm is so substantial that it outperforms CAPE on sequence with 0.9 or more expected substitutions per site. CAPE itself gives improved sensitivity for sequence with 0.7 or more expected substitutions per site, but at a great run time cost. FEAST and our local extension algorithm improves on this too, the run time is only slightly slower than existing local alignment algorithms and asymptotically the same. bioinformatics pairwise alignment Hidden Markov Models Computer Science
29	Probabilistic Models for Genetic and Genomic Data with Missing Information Hicks, Stephanie 16 September 2013 (has links) Genetic and genomic data often contain unobservable or missing information. Applications of probabilistic models such as mixture models and hidden Markov models (HMMs) have been widely used since the 1960s to make inference on unobserved information using some observed information demonstrating the versatility and importance of these models. Biological applications of mixture models include gene expression data, meta-analysis, disease mapping, epidemiology and pharmacology and applications of HMMs include gene finding, linkage analysis, phylogenetic analysis and identifying regions of identity-by-descent. An important statistical and informatics challenge posed by modern genetics is to understand the functional consequences of genetic variation and its relation to phenotypic variation. In the analysis of whole-exome sequencing data, predicting the impact of missense mutations on protein function is an important factor in identifying and determining the clinical importance of disease susceptibility mutations in the absence of independent data determining impact on disease. In addition to the interpretation, identifying co-inherited regions of related individuals with Mendelian disorders can further narrow the search for disease susceptibility mutations. In this thesis, we develop two probabilistic models in application of genetic and genomic data with missing information: 1) a mixture model to estimate a posterior probability of functionality of missense mutations and 2) a HMM to identify co-inherited regions in the exomes of related individuals. The first application combines functional predictions from available computational or {\it in silico} methods which often have a high degree of disagreement leading to conflicting results for the user to assess the pathogenic impact of missense mutations on protein function. The second application considers extensions of a first-order HMM to include conditional emission probabilities varying as a function of minor allele frequency and a second-order dependence structure between observed variant calls. We apply these models to whole-exome sequencing data and show how these models can be used to identify disease susceptibility mutations. As disease-gene identification projects increasingly use next-generation sequencing, the probabilistic models developed in this thesis help identify and associate relevant disease-causing mutations with human disorders. The purpose of this thesis is to demonstrate that probabilistic models can contribute to more accurate and dependable inference based on genetic and genomic data with missing information. Statistics Statistical Genomics Bioinformatics Mixture Models Hidden Markov Models
30	Enhancements to Hidden Markov Models for Gene Finding and Other Biological Applications Vinar, Tomas January 2005 (has links) In this thesis, we present enhancements of hidden Markov models for the problem of finding genes in DNA sequences. Genes are the parts of DNA that serve as a template for synthesis of proteins. Thus, gene finding is a crucial step in the analysis of DNA sequencing data. <br /><br /> Hidden Markov models are a key tool used in gene finding. Yhis thesis presents three methods for extending the capabilities of hidden Markov models to better capture the statistical properties of DNA sequences. In all three, we encounter limiting factors that lead to trade-offs between the model accuracy and those limiting factors. <br /><br /> First, we build better models for recognizing biological signals in DNA sequences. Our new models capture non-adjacent dependencies within these signals. In this case, the main limiting factor is the amount of training data: more training data allows more complex models. Second, we design methods for better representation of length distributions in hidden Markov models, where we balance the accuracy of the representation against the running time needed to find genes in novel sequences. Finally, we show that creating hidden Markov models with complex topologies may be detrimental to the prediction accuracy, unless we use more complex prediction algorithms. However, such algorithms require longer running time, and in many cases the prediction problem is NP-hard. For gene finding this means that incorporating some of the prior biological knowledge into the model would require impractical running times. However, we also demonstrate that our methods can be used for solving other biological problems, where input sequences are short. <br /><br /> As a model example to evaluate our methods, we built a gene finder ExonHunter that outperforms programs commonly used in genome projects. Computer Science gene finding hidden Markov models probabilistic modeling

Search results