Global ETD Search

51	Recherche de domaines protéiques divergents à l'aide de modèles de Markov cachés : application à Plasmodium falciparum / Protein Domain Detection with Hidden Markov Models : application to Plasmodium falciparum Terrapon, Nicolas 03 December 2010 (has links) Les modèles de Markov cachés (MMC) par exemple ceux de la librairie Pfam sont des outils très populaires pour l'annotation des domaines protéiques. Cependant, ils ne sont pas toujours adaptés aux protéines les plus divergentes. C'est notamment le cas avec Plasmodium falciparum (principal agent du paludisme chez l'Homme), où les MMC de Pfam identifient peu de familles distinctes de domaines, et couvrent moins de 50% des protéines de l'organisme. L'objectif de cette thèse est d'apporter des méthodes nouvelles pour affiner la détection de domaines dans les protéines divergentes.Le premier axe développé est une approche d'identification de domaines utilisant leurs propriétés de co-occurrence. Différentes études ont montré que la majorité des domaines apparaissent dans les protéines avec un ensemble très réduits d'autres domaines favoris. Notre méthode exploite cette propriété pour détecter des domaines trop divergents pour être identifiés par l'approche classique. Cette détection s'accompagne d'une estimation du taux d'erreur par une procédure de ré-échantillonnage. Chez P. falciparum, elle permet d'identifier, avec un taux d'erreur estimé inférieur à 20%, 585 nouveaux domaines dont 159 familles étaient inédites dans cet organisme ce qui représente 16% du nombre de domaines connus.Le second axe de mes recherches présente plusieurs méthodes de corrections statistiques et évolutives des MMC pour l'annotation d'organismes divergents. Deux types d'approches ont été proposées. D'un côté, nous intégrons aux alignements d'apprentissage des MMC, les séquences précédemment identifiés dans l'organisme cible ou ses proches relatifs. La limitation de cette solution est que seules des familles de domaines déjà connues dans le taxon peuvent ainsi être identifiées. Le deuxième type d'approche contourne cette limitation en corrigeant tous les modèles par une prise en compte de l'évolution des séquences d'apprentissage. Pour cela, nous faisons appel à des techniques classiques de la bioinformatique et de l'apprentissage statistique. Les résultats obtenus offrent un ensemble de prédictions complémentaires totalisant 663 nouveaux domaines supplémentaires dont 504 familles inédites soit une augmentation de 18% à ajouter aux précédents résultats. / Hidden Markov Models (HMMs) from Pfam database for example are popular tools for protein domain annotation. However, they are not well suited for studying highly divergent proteins. This is notably the case with Plasmodium falciparum (main causal agent of human malaria), where Pfam HMMs identify few distinct domain families and cover less than 50% of its proteins. This thesis aims at providing new methods to enhance domain detection in divergent proteins.The first axis of this work is an approach of domain identification based on domain co-occurrence. Several studies shown that a majority of domains appear in proteins with a small set of other favourite domains. Our method exploits this tendency to detect domains escaping to the classical procedure because of their divergence. Detected domains come along with an false discovery rate (FDR) estimation computed with a shuffling procedure. In P. falciparum proteins, this approach allows us identify, with an FDR below 20%, 585 new domains with 159 families that were previously unseen in this organism which account for 16% of the known domains.The second axis of my researches involves the development of statistical and evolutionary methods of HMM correction to improve the annotation of divergent organisms. Two kind of approaches are proposed. On the one hand, the sequences previously identified in the target organism and its close relatives are integrated in the learning alignments. An obvious limitation of this solution is that only new occurrences of previously known families in the taxon can be discovered. On the other hand, we evade this limitation by adjusting HMM parameters by simulating the evolution of the learning sequences. To this end, classical techniques from bioinformatics and statistical learning were used. Alternative libraries offer a complementary set of predictions summing 663 new domains with 504 previously unseen families corresponding to an improvement of 18% to add to the previous results. Domaines protéiques Modèles de Markov cachés Paludisme Protein Domains Hidden Markov Models Malaria
52	Estimação de modelos de Markov ocultos usando aritmética intervalar / Estimating hidden Markov model parameters using interval arithmetic Montanher, Tiago de Morais 24 April 2015 (has links) Modelos de Markov ocultos (MMOs) são uma ferramenta importante em matemática aplicada e estatística. Eles se baseiam em dois processos estocásticos. O primeiro é uma cadeia de Markov, que não é observada diretamente. O segundo é observável e sua distribuição depende do estado na cadeia de Markov. Supomos que os processos são discretos no tempo e assumem um número finito de estados. Para extrair informações dos MMOs, é necessário estimar seus parâmetros. Diversos algoritmos locais têm sido utilizados nas últimas décadas para essa tarefa. Nosso trabalho estuda a estimação de parâmetros em modelos de Markov ocultos, do ponto de vista da otimização global. Desenvolvemos algoritmos capazes de encontrar, em uma execução bem sucedida, todos os estimadores de máxima verossimilhança globais de um modelo de Markov oculto. Para tanto, usamos aritmética intervalar. Essa aritmética permite explorar sistematicamente o espaço paramétrico, excluindo regiões que não contém soluções. O cálculo da função objetivo é feito através da recursão \\textit, descrita na literatura estatística. Modificamos a extensão intervalar natural dessa recursão usando programação linear. Nossa abordagem é mais eficiente e produz intervalos mais estreitos do que a implementação padrão. Experimentos mostram ganhos de 16 a 250 vezes, de acordo com a complexidade do modelo. Revisamos os algoritmos locais, tendo em vista sua aplicação em métodos globais. Comparamos os algoritmos de Baum-Welch, pontos interiores e gradientes projetados espectrais. Concluímos que o método de Baum-Welch é o mais indicado como auxiliar em otimização global. Modificamos o \\textit{interval branch and bound} para resolver a estimação de modelos com eficiência. Usamos as condições KKT e as simetrias do problema na construção de testes para reduzir ou excluir caixas. Implementamos procedimentos de aceleração da convergência, como o método de Newton intervalar e propagação de restrições e da função objetivo. Nosso algoritmo foi escrito em \\textit{C++}, usando programação genérica. Mostramos que nossa implementação dá resultados tão bons quanto o resolvedor global BARON, porém com mais eficiência. Em média, nosso algoritmo é capaz de resolver $50\\%$ mais problemas no mesmo período de tempo. Concluímos estudando aspectos qualitativos dos MMOs com mistura Bernoulli. Plotamos todos os máximos globais detectados em instâncias com poucas observações e apresentamos novos limitantes superiores da verossimilhança baseados na divisão de uma amostra grande em grupos menores. / Hidden Markov models(HMMs) are an important tool in statistics and applied mathematics. Our work deals with processes formed by two discrete time and finite state space stochastic processes. The first process is a Markov chain and is not directly observed. On the other hand, the second process is observable and its distribution depends on the current state of the hidden component. In order to extract conclusions from a Hidden Markov Model we must estimate the parameters that defines it. Several local algorithms has been used to handle with this task. We present a global optimization approach based on interval arithmetic to maximize the likelihood function. Interval arithmetic allow us to explore parametric space systematically, discarding regions which cannot contain global maxima. We evaluate the objective function and its derivatives by the so called backward recursion and show that is possible to obtain sharper interval extensions for such functions using linear programming. Numerical experiments shows that our approach is $16$ to $250$ times more efficient than standard implementations. We also study local optimization algorithms hidden Markov model estimation. We compare Baum-Welch procedure with interior points and spectral projected gradients. We conclude that Baum-Welch is the best option as a sub-algorithm in a global optimization framework. We improve the well known interval branch and bound algorithm to take advantages on the problem structure. We derive new exclusion tests, based on its KKT conditions and symmetries. We implement our approach in C++, under generic programming paradigm. We show that our implementation is compatible with global optimization solver BARON in terms of precision. We also show that our algorithm is faster than BARON. In average, we can handle with $50\\%$ more problems within the same amount of time. We conclude studying qualitative aspects of Bernoulli hidden Markov models. We plot all global maxima found in small observations instances and show a new upper bound of the likelihood based on splitting observations in small groups. Aritmética intervalar Global optimization Hidden Markov models Interval arithmetic Modelos de Markov ocultos Otimização global
53	Avaliando um rotulador estatístico de categorias morfo-sintáticas para a língua portuguesa / Evaluating a stochastic part-of-speech tagger for the portuguese language Villavicencio, Aline January 1995 (has links) O Processamento de Linguagem Natural (PLN) é uma área da Ciência da Computação, que vem tentando, ao longo dos anos, aperfeiçoar a comunicação entre o homem e o computador. Varias técnicas tem sido utilizadas para aperfeiçoar esta comunicação, entre elas a aplicação de métodos estatísticos. Estes métodos tem sido usados por pesquisadores de PLN, com um crescente sucesso e uma de suas maiores vantagens é a possibilidade do tratamento de textos irrestritos. Em particular, a aplicação dos métodos estatísticos, na marcação automática de "corpus" com categorias morfo-sintáticas, tem se mostrado bastante promissora, obtendo resultados surpreendentes. Assim sendo, este trabalho descreve o processo de marcação automática de categorias morfo-sintáticas. Inicialmente, são apresentados e comparados os principais métodos aplicados a marcação automática: os métodos baseados em regras e os métodos estatísticos. São descritos os principais formalismos e técnicas usadas para esta finalidade pelos métodos estatísticos. E introduzida a marcação automática para a Língua Portuguesa, algo até então inédito. O objetivo deste trabalho é fazer um estudo detalhado e uma avaliação do sistema rotulador de categorias morfo-sintáticas, a fim de que se possa definir um padrão no qual o sistema apresente a mais alta precisão possível. Para efetuar esta avaliação, são especificados alguns critérios: a qualidade do "corpus" de treinamento, o seu tamanho e a influencia das palavras desconhecidas. A partir dos resultados obtidos, espera-se poder aperfeiçoar o sistema rotulador, de forma a aproveitar, da melhor maneira possível, os recursos disponíveis para a Língua Portuguesa. / Natural Language Processing (NLP) is an area of Computer Sciences, that have been trying to improve communication between human beings and computers. A number of different techniques have been used to improve this communication and among them, the use of stochastic methods. These methods have successfully being used by NLP researchers and one of their most remarkable advantages is that they are able to deal with unrestricted texts. Namely, the use of stochastic methods for part-of-speech tagging has achieving some extremely good results. Thus, this work describes the process of part-of-speech tagging. At first, we present and compare the main tagging methods: the rule-based methods and the stochastic ones. We describe the main stochastic tagging formalisms and techniques for part-of-speech tagging. We also introduce part-of-speech tagging for the Portuguese Language. The main purpose of this work is to study and evaluate a part-of-speech tagger system in order to establish a pattern in which it is possible to achieve the greatest accuracy. To perform this evaluation, several parameters were set: the corpus quality, its size and the relation between unknown words and accuracy. The results obtained will be used to improve the tagger, in order to use better the available Portuguese Language resources. Linguística computacional Processamento : Linguagem natural Part-of-speech taggers Corpus Hidden markov models
54	Finite horizon robust state estimation for uncertain finite-alphabet hidden Markov models Xie, Li, Information Technology & Electrical Engineering, Australian Defence Force Academy, UNSW January 2004 (has links) In this thesis, we consider a robust state estimation problem for discrete-time, homogeneous, first-order, finite-state finite-alphabet hidden Markov models (HMMs). Based on Kolmogorov's Theorem on the existence of a process, we first present the Kolmogorov model for the HMMs under consideration. A new change of measure is introduced. The statistical properties of the Kolmogorov representation of an HMM are discussed on the canonical probability space. A special Kolmogorov measure is constructed. Meanwhile, the ergodicity of two expanded Markov chains is investigated. In order to describe the uncertainty of HMMs, we study probability distance problems based on the Kolmogorov model of HMMs. Using a change of measure technique, the relative entropy and the relative entropy rate as probability distances between HMMs, are given in terms of the HMM parameters. Also, we obtain a new expression for a probability distance considered in the existing literature such that we can use an information state method to calculate it. Furthermore, we introduce regular conditional relative entropy as an a posteriori probability distance to measure the discrepancy between HMMs when a realized observation sequence is given. A representation of the regular conditional relative entropy is derived based on the Radon-Nikodym derivative. Then a recursion for the regular conditional relative entropy is obtained using an information state method. Meanwhile, the well-known duality relationship between free energy and relative entropy is extended to the case of regular conditional relative entropy given a sub-[special character]-algebra. Finally, regular conditional relative entropy constraints are defined based on the study of the probability distance problem. Using a Lagrange multiplier technique and the duality relationship for regular conditional relative entropy, a finite horizon robust state estimator for HMMs with regular conditional relative entropy constraints is derived. A complete characterization of the solution to the robust state estimation problem is also presented. Entropy estimation hidden Markov models (HMM) Kolmogorov Markov processes optimization posteriori probability robust state estimator
55	Efficient Methods for Automatic Speech Recognition Seward, Alexander January 2003 (has links) This thesis presents work in the area of automatic speech recognition (ASR). The thesis focuses on methods for increasing the efficiency of speech recognition systems and on techniques for efficient representation of different types of knowledge in the decoding process. In this work, several decoding algorithms and recognition systems have been developed, aimed at various recognition tasks. The thesis presents the KTH large vocabulary speech recognition system. The system was developed for online (live) recognition with large vocabularies and complex language models. The system utilizes weighted transducer theory for efficient representation of different knowledge sources, with the purpose of optimizing the recognition process. A search algorithm for efficient processing of hidden Markov models (HMMs) is presented. The algorithm is an alternative to the classical Viterbi algorithm for fast computation of shortest paths in HMMs. It is part of a larger decoding strategy aimed at reducing the overall computational complexity in ASR. In this approach, all HMM computations are completely decoupled from the rest of the decoding process. This enables the use of larger vocabularies and more complex language models without an increase of HMM-related computations. Ace is another speech recognition system developed within this work. It is a platform aimed at facilitating the development of speech recognizers and new decoding methods. A real-time system for low-latency online speech transcription is also presented. The system was developed within a project with the goal of improving the possibilities for hard-of-hearing people to use conventional telephony by providing speech-synchronized multimodal feedback. This work addresses several additional requirements implied by this special recognition task. / QC 20100811 speech recognition algorithms hidden markov models HMM weigted finite-state transducers
56	PELICAN : a PipELIne, including a novel redundancy-eliminating algorithm, to Create and maintain a topicAl family-specific Non-redundant protein database Andersson, Christoffer January 2005 (has links) The increasing number of biological databases today requires that users are able to search more efficiently among as well as in individual databases. One of the most widespread problems is redundancy, i.e. the problem of duplicated information in sets of data. This thesis aims at implementing an algorithm that distinguishes from other related attempts by using the genomic positions of sequences, instead of similarity based sequence comparisons, when making a sequence data set non-redundant. In an automatic updating procedure the algorithm drastically increases the possibility to update and to maintain the topicality of a non-redundant database. The procedure creates a biologically sound non-redundant data set with accuracy comparable to other algorithms focusing on making data sets non-redundant redundancy BLAT genomic positions profile hidden Markov models G-protein coupled receptors Bioinformatics Bioinformatik
57	Convergence in distribution for filtering processes associated to Hidden Markov Models with densities Kaijser, Thomas January 2013 (has links) A Hidden Markov Model generates two basic stochastic processes, a Markov chain, which is hidden, and an observation sequence. The filtering process of a Hidden Markov Model is, roughly speaking, the sequence of conditional distributions of the hidden Markov chain that is obtained as new observations are received. It is well-known, that the filtering process itself, is also a Markov chain. A classical, theoretical problem is to find conditions which implies that the distributions of the filtering process converge towards a unique limit measure. This problem goes back to a paper of D Blackwell for the case when the Markov chain takes its values in a finite set and it goes back to a paper of H Kunita for the case when the state space of the Markov chain is a compact Hausdor space. Recently, due to work by F Kochmann, J Reeds, P Chigansky and R van Handel, a necessary and sucient condition for the convergence of the distributions of the filtering process has been found for the case when the state space is finite. This condition has since been generalised to the case when the state space is denumerable. In this paper we generalise some of the previous results on convergence in distribution to the case when the Markov chain and the observation sequence of a Hidden Markov Model take their values in complete, separable, metric spaces; it has though been necessary to assume that both the transition probability function of the Markov chain and the transition probability function that generates the observation sequence have densities. Hidden Markov Models filtering processes convergence in distribution barycenter ergodicity Kantorovich metric.
58	Actuarial Inference and Applications of Hidden Markov Models Till, Matthew Charles January 2011 (has links) Hidden Markov models have become a popular tool for modeling long-term investment guarantees. Many different variations of hidden Markov models have been proposed over the past decades for modeling indexes such as the S&P 500, and they capture the tail risk inherent in the market to varying degrees. However, goodness-of-fit testing, such as residual-based testing, for hidden Markov models is a relatively undeveloped area of research. This work focuses on hidden Markov model assessment, and develops a stochastic approach to deriving a residual set that is ideal for standard residual tests. This result allows hidden-state models to be tested for goodness-of-fit with the well developed testing strategies for single-state models. This work also focuses on parameter uncertainty for the popular long-term equity hidden Markov models. There is a special focus on underlying states that represent lower returns and higher volatility in the market, as these states can have the largest impact on investment guarantee valuation. A Bayesian approach for the hidden Markov models is applied to address the issue of parameter uncertainty and the impact it can have on investment guarantee models. Also in this thesis, the areas of portfolio optimization and portfolio replication under a hidden Markov model setting are further developed. Different strategies for optimization and portfolio hedging under hidden Markov models are presented and compared using real world data. The impact of parameter uncertainty, particularly with model parameters that are connected with higher market volatility, is once again a focus, and the effects of not taking parameter uncertainty into account when optimizing or hedging in a hidden Markov are demonstrated. Hidden Markov Models Regime-Switching Models Residual Analysis MCMC Portfolio Optimization Portfolio Replication Actuarial Science
59	The Continuous Speech Recognition System Base on Hidden Markov Models with One-Stage Dynamic Programming Algorithm. Hsieh, Fang-Yi 03 July 2003 (has links) Based on Hidden Markov Models (HMM) with One-Stage Dynamic Programming Algorithm, a continuous-speech and speaker-independent Mandarin digit speech recognition system was designed in this work. In order to implement this architecture to fit the performance of hardware, various parameters of speech characteristics were defined to optimize the process. Finally, the ¡§State Duration¡¨ and the ¡§Tone Transition Property Parameter¡¨ were extracted from speech temporal information to improve the recognition rate. Via using the test database, experimental results show that this new ideal of one-stage dynamic programming algorithm , with ¡§state duration¡¨ and ¡§ tone transition property parameter¡¨ , will have 18% recognition rate increase when compare to the conventional one. For speaker-independent and connect-word recognition, this system will achieve recognition rate to 74%. For speaker-independent but isolate-word recognition, it will have recognition rate higher than 96%. Recognition rate of 92% is obtained as this system is applied to the connect-word speaker-dependent recognition. Hidden Markov Models Continuous Speech Recognition One-Stage Dynamic Programming Algorithm
60	Improving the efficacy of automated sign language practice tools Brashear, Helene Margaret 07 July 2010 (has links) The CopyCat project is an interdisciplinary effort to create a set of computer-aided language learning tools for deaf children. The CopyCat games allow children to interact with characters using American Sign Language (ASL). Through Wizard of Oz pilot studies we have developed a set of games, shown their efficacy in improving young deaf children's language and memory skills, and collected a large corpus of signing examples. Our previous implementation of the automatic CopyCat games uses automatic sign language recognition and verification in the infrastructure of a memory repetition and phrase verification task. The goal of my research is to expand the automatic sign language system to transition the CopyCat games to include the flexibility of a dialogue system. I have created a labeling ontology from analysis of the CopyCat signing corpus, and I have used the ontology to describe the contents of the CopyCat data set. This ontology was used to change and improve the automatic sign language recognition system and to add flexibility to language use in the automatic game. Automatic sign language recognition ASL Assistive technology Sign language Hidden Markov models Automatic speech recognition

Search results