Spelling suggestions: "subject:"hidden markov models"" "subject:"hidden darkov models""
151 |
In silico identification of PPR proteinsLe Sieur, Félix-Antoine 08 1900 (has links)
Les protéines PentatricoPeptide-Repeats (PPR) représentent la plus grande famille de protéines de liaison à l’ARN connue. Elles sont caractérisées par la présence de motifs répétés en tandem d’environ 35 résidus ayant une structure hélice-tour-hélice. Depuis les premières études sur l’organisme modèle Arabidopsis thaliana, les protéines PPR ont aussi été découvertes chez d’autres espèces non-plantes, incluant les levures et l’humain. Cependant, la détection des protéines PPR en dehors des plantes est compliquée par le fait que les outils de recherche sont tous conçus pour les protéines de plantes. Récemment, une étude réalisée chez les levures a rapporté une méthode itérative semi-automatisée d’identification de PPR utilisant des profils Hidden Markov Models (HMM). Inspirés par cette approche, nous visons ici à développer une méthode complètement automatisée plus généralisable et sensible qui ne dépend pas du protéome de départ. Comme preuve de concept, nous avons choisi une espèce non reliée aux plantes possédant le plus grand nombre de protéines PPR en-dehors des plantes – le protiste marin unicellulaire Diplonema papillatum. Il s’agit d’un modèle émergent ayant reçu beaucoup d’intérêt pour l’excentricité de l’expression de son génome mitochondrial, pour lequel il a été suggéré que les protéines PPR jouent un rôle clé. Nous avons ici développé une approche itérative pour identifier et cataloguer les protéines PPR chez D. papillatum. Les fonctionnalités particulières de notre algorithme incluent l’inspection des intervalles de 30 à 40 résidus entre les motifs classiques déjà identifiés et l’utilisation des structures secondaires caractéristiques des motifs PPR pour valider les motifs candidats nouvellement identifiés. Au final, nous avons identifié près de 800 motifs PPR chez D.papillatum, dont plusieurs motifs « déviants » identifiés dans les espaces entre les motifs. La validation expérimentale des motifs candidats les plus prometteurs est en attente. / PentatricoPeptide-Repeat (PPR) proteins represent the largest family of RNA-binding proteins known. They are defined by containing tandemly arranged, ~35-residue long motifs assuming a helix-turn-helix structure, which are referred to as PPR motifs. Since the seminal studies undertaken in the model organism Arabidopsis, a few PPR proteins have been also discovered outside plants, including yeast and human. However, the detection of PPR proteins in non-plant eukaryotes is complicated by the fact that current search tools are tailored toward plants. Recently, a semi-automated method has been reported for identifying PPR motifs in yeast using iterative searches with profile Hidden Markov models (HMMs). Inspired by this work, we aimed to develop a fully automated, sensitive approach that can be used for detecting PPR proteins in any species, when using the corresponding proteome as input. For a proof of concept, we used a species that contains the largest number of PPR genes outside the plant kingdom –the unicellular protist Diplonema papillatum. This emerging model system has garnered much interest for the eccentricities of its mitochondrial gene expression, in which PPR proteins are posited to play a key role. Here, we have developed an iterative HMM-search method that comprehensively catalogues and classifies PPR motifs in D. papillatum. Particular features of our algorithm are that it inspects closely 30 to 40 residue-long intervals between readily identified (classical) motifs, makes use of the characteristic secondary structure of PPR motifs to validate newly detected candidate motifs. In total, we have identified around 800 PPR motifs in D. papillatum. Including several deviant candidates detected in ”gaps”. High ranking representatives of both classical and deviant motifs await experimental validation.
|
152 |
VEHICLE RESPONSE PREDICTION USING PHYSICAL AND MACHINE LEARNING MODELSLanka, Venkata Raghava Ravi Teja, Lanka January 2017 (has links)
No description available.
|
153 |
Cellular diagnostic systems using hidden Markov modelsMohammad, Maruf H. 29 November 2006 (has links)
Radio frequency system optimization and troubleshooting remains one of the most challenging aspects of working in a cellular network. To stay competitive, cellular providers continually monitor the performance of their networks and use this information to determine where to improve or expand services. As a result, operators are saddled with the task of wading through overwhelmingly large amounts of data in order to trouble-shoot system problems. Part of the difficulty of this task is that for many complicated problems such as hand-off failure, clues about the cause of the failure are hidden deep within the statistics of underlying dynamic physical phenomena like fading, shadowing, and interference. In this research we propose that Hidden Markov Models (HMMs) be used as a method to infer signature statistics about the nature and sources of faults in a cellular system by fitting models to various time-series data measured throughout the network. By including HMMs in the network management tool, a provider can explore the statistical relationships between channel dynamics endemic to a cell and its resulting performance.
This research effort also includes a new distance measure between a pair of HMMs that approximates the Kullback-Leibler divergence (KLD). Since there is no closed-form solution to calculate the KLD between the HMMs, the proposed analytical expression is very useful in classification and identification problems. A novel HMM based position location technique has been introduced that may be very useful for applications involving cognitive radios. / Ph. D.
|
154 |
An integrated approach to feature compensation combining particle filters and Hidden Markov Models for robust speech recognitionMushtaq, Aleem 19 September 2013 (has links)
The performance of automatic speech recognition systems often degrades in adverse conditions where there is a mismatch between training and testing conditions. This is true for most modern systems which employ Hidden Markov Models (HMMs) to decode speech utterances. One strategy is to map the distorted features back to clean speech features that correspond well to the features used for training of HMMs. This can be achieved by treating the noisy speech as the distorted version of the clean speech of interest. Under this framework, we can track and consequently extract the underlying clean speech from the noisy signal and use this derived signal to perform utterance recognition. Particle filter is a versatile tracking technique that can be used where often conventional techniques such as Kalman filter fall short. We propose a particle filters based algorithm to compensate the corrupted features according to an additive noise model incorporating both the statistics from clean speech HMMs and observed background noise to map noisy features back to clean speech features. Instead of using specific knowledge at the model and state levels from HMMs which is hard to estimate, we pool model states into clusters as side information. Since each cluster encompasses more statistics when compared to the original HMM states, there is a higher possibility that the newly formed probability density function at the cluster level can cover the underlying speech variation to generate appropriate particle filter samples for feature compensation. Additionally, a dynamic joint tracking framework to monitor the clean speech signal and noise simultaneously is also introduced to obtain good noise statistics. In this approach, the information available from clean speech tracking can be effectively used for noise estimation. The availability of dynamic noise information can enhance the robustness of the algorithm in case of large fluctuations in noise parameters within an utterance. Testing the proposed PF-based compensation scheme on the Aurora 2 connected digit recognition task, we achieve an error reduction of 12.15% from the best multi-condition trained models using this integrated PF-HMM framework to estimate the cluster-based HMM state sequence information. Finally, we extended the PFC framework and evaluated it on a large-vocabulary recognition task, and showed that PFC works well for large-vocabulary systems also.
|
155 |
Improving algorithms of gene prediction in prokaryotic genomes, metagenomes, and eukaryotic transcriptomesTang, Shiyuyun 27 May 2016 (has links)
Next-generation sequencing has generated enormous amount of DNA and RNA sequences that potentially carry volumes of genetic information, e.g. protein-coding genes. The thesis is divided into three main parts describing i) GeneMarkS-2, ii) GeneMarkS-T, and iii) MetaGeneTack.
In prokaryotic genomes, ab initio gene finders can predict genes with high accuracy. However, the error rate is not negligible and largely species-specific. Most errors in gene prediction are made in genes located in genomic regions with atypical GC composition, e.g. genes in pathogenicity islands. We describe a new algorithm GeneMarkS-2 that uses local GC-specific heuristic models for scoring individual ORFs in the first step of analysis. Predicted atypical genes are retained and serve as ‘external’ evidence in subsequent runs of self-training. GeneMarkS-2 also controls the quality of training process by effectively selecting optimal orders of the Markov chain models as well as duration parameters in the hidden semi-Markov model. GeneMarkS-2 has shown significantly improved accuracy compared with other state-of-the-art gene prediction tools.
Massive parallel sequencing of RNA transcripts by the next generation technology (RNA-Seq) provides large amount of RNA reads that can be assembled to full transcriptome. We have developed a new tool, GeneMarkS-T, for ab initio identification of protein-coding regions in RNA transcripts. Unsupervised estimation of parameters of the algorithm makes unnecessary several steps in the conventional gene prediction protocols, most importantly the manually curated preparation of training sets. We have demonstrated that the GeneMarkS-T self-training is robust with respect to the presence of errors in assembled transcripts and the accuracy of GeneMarkS-T in identifying protein-coding regions and, particularly, in predicting gene starts compares favorably to other existing methods.
Frameshift prediction (FS) is important for analysis and biological interpretation of metagenomic sequences. Reads in metagenomic samples are prone to sequencing errors. Insertion and deletion errors that change the coding frame impair the accurate identification of protein coding genes. Accurate frameshift prediction requires sufficient amount of data to estimate parameters of species-specific statistical models of protein-coding and non-coding regions. However, this data is not available; all we have is metagenomic sequences of unknown origin. The challenge of ab initio FS detection is, therefore, twofold: (i) to find a way to infer necessary model parameters and (ii) to identify positions of frameshifts (if any). We describe a new tool, MetaGeneTack, which uses a heuristic method to estimate parameters of sequence models used in the FS detection algorithm. It was shown on several test sets that the performance of MetaGeneTack FS detection is comparable or better than the one of earlier developed program FragGeneScan.
|
156 |
Arabic text recognition of printed manuscripts : efficient recognition of off-line printed Arabic text using Hidden Markov Models, Bigram Statistical Language Model, and post-processingAl-Muhtaseb, Husni Abdulghani January 2010 (has links)
Arabic text recognition was not researched as thoroughly as other natural languages. The need for automatic Arabic text recognition is clear. In addition to the traditional applications like postal address reading, check verification in banks, and office automation, there is a large interest in searching scanned documents that are available on the internet and for searching handwritten manuscripts. Other possible applications are building digital libraries, recognizing text on digitized maps, recognizing vehicle license plates, using it as first phase in text readers for visually impaired people and understanding filled forms. This research work aims to contribute to the current research in the field of optical character recognition (OCR) of printed Arabic text by developing novel techniques and schemes to advance the performance of the state of the art Arabic OCR systems. Statistical and analytical analysis for Arabic Text was carried out to estimate the probabilities of occurrences of Arabic character for use with Hidden Markov models (HMM) and other techniques. Since there is no publicly available dataset for printed Arabic text for recognition purposes it was decided to create one. In addition, a minimal Arabic script is proposed. The proposed script contains all basic shapes of Arabic letters. The script provides efficient representation for Arabic text in terms of effort and time. Based on the success of using HMM for speech and text recognition, the use of HMM for the automatic recognition of Arabic text was investigated. The HMM technique adapts to noise and font variations and does not require word or character segmentation of Arabic line images. In the feature extraction phase, experiments were conducted with a number of different features to investigate their suitability for HMM. Finally, a novel set of features, which resulted in high recognition rates for different fonts, was selected. The developed techniques do not need word or character segmentation before the classification phase as segmentation is a byproduct of recognition. This seems to be the most advantageous feature of using HMM for Arabic text as segmentation tends to produce errors which are usually propagated to the classification phase. Eight different Arabic fonts were used in the classification phase. The recognition rates were in the range from 98% to 99.9% depending on the used fonts. As far as we know, these are new results in their context. Moreover, the proposed technique could be used for other languages. A proof-of-concept experiment was conducted on English characters with a recognition rate of 98.9% using the same HMM setup. The same techniques where conducted on Bangla characters with a recognition rate above 95%. Moreover, the recognition of printed Arabic text with multi-fonts was also conducted using the same technique. Fonts were categorized into different groups. New high recognition results were achieved. To enhance the recognition rate further, a post-processing module was developed to correct the OCR output through character level post-processing and word level post-processing. The use of this module increased the accuracy of the recognition rate by more than 1%.
|
157 |
Construção e aplicação de HMMs de perfil para a detecção e classificação de vírus / Construction and application of profile HMMs for the specific detection and classification of virusesGuimarães, Miriã Nunes 22 February 2019 (has links)
Os vírus são as entidades biológicas mais abundantes encontradas na natureza. O método clássico de estudo dos vírus requerem seu isolamento e propagação in vitro. Contudo, necessita-se ter um conhecimento prévio sobre as condições necessárias para seu cultivo em células, sendo assim a maior parte dos vírus existentes não é conhecida. Análises metagenômicas são uma alternativa para a detecção e caracterização de novos vírus, uma vez que não requerem um cultivo prévio e as amostras podem conter material genético de múltiplos organismos. Uma vez obtidas as sequências montadas a partir das leituras metagenômicas, o método mais utilizado para a identificação e classificação dos organismos é a busca de similaridade com o programa BLAST contra bancos de sequências conhecidas. Contudo, métodos de alinhamento pareado são capazes de identificar apenas sequências com identidade superior a 20-30%. Uma alternativa a essa limitação é o uso de métodos baseados no uso de perfis, que podem aumentar a sensibilidade de detecção de homólogos filogeneticamente distantes. HMMs de perfil são modelos probabilísticos capazes de representar a diversidade de caracteres em posições-específicas de um alinhamento de múltiplas sequências. Nosso grupo desenvolveu a ferramenta TABAJARA, utilizada neste projeto, para a identificação de blocos que podem ser conservados em todas as sequências do alinhamento ou discriminativos entre grupos de sequências. Esses blocos são utilizados para a geração de HMMs de perfil, os quais podem ser usados, no contexto da virologia, para a identificação de grupos taxonômicos amplos como famílias virais ou, ainda, taxa mais restritos como gêneros ou mesmo espécies de vírus. O presente projeto teve como objetivos aplicar e otimizar o programa TABAJARA em diferentes grupos taxonômicos de vírus, construir modelos específicos para cada um desses grupos e validar esses modelos em dados metagenômicos. O primeiro modelo de estudo escolhido foi a ordem Bunyavirales, composta de vírus de ssRNA (-) majoritariamente envelopados e esféricos, com genoma segmentado e pertencentes ao grupo 5 da classificação de Baltimore. Este grupo inclui vírus causadores de várias doenças em humanos, animais e plantas. O segundo modelo de estudo escolhido foi a família Togaviridae, composta de vírus de ssRNA (+) envelopados e esféricos, cujo genoma expressa uma poliproteína e pertencem ao grupo 4 da classificação de Baltimore. Este grupo inclui o vírus Chikungunya e outras espécies que causam diversas patologias ao homem. O terceiro modelo de estudo escolhido foi a subfamília Spounavirinae, compreendendo bacteriófagos que infectam vários hospedeiros bacterianos e em alguns casos possuem potencial terapêutico comprovado contra infecções bacterianas que afetam o homem. Estes fagos apresentam partículas virais com estrutura cabeça-cauda, não são envelopados, apresentam genoma de dsDNA e pertencem ao grupo 1 da classificação de Baltimore. Todos os modelos construídos foram validados quanto à sensibilidade e especificidade de detecção e, ao final, foram utilizados em análises de prospecção de vírus em dados metagenômicos obtidos na base SRA do NCBI. Os HMMs de perfil apresentaram excelente desempenho, comprovando a viabilidade da metodologia proposta neste projeto. Os resultados apresentados neste trabalho abrem a perspectiva da ampla utilização de HMMs de perfil como ferramentas universais para a detecção e classificação de vírus em dados metagenômicos. / Viruses are the most widely biological entities found in nature. Most of the information that can be obtained from these organisms requires viral in vitro isolation and cultivation. However, most of the existing viruses are still unknown because the biological requirements for their successful propagation have not been identified so far. Metagenomic analyses offer an interesting alternative for the detection and characterization of novel viruses, since previous cultivation is not required, and the samples may contain genetic material of multiple organisms. Once assembled sequences are obtained from individual reads, the most widely used method for viral identification and classification is the use of BLAST similarity searches against databases of known sequences. However, pairwise alignment methods are only able to identify sequences that present identity greater than 20-30%. Profile-based methods may increase the sensitivity of detection of remote homologues. Profile HMMs are probabilistic models capable of representing the diversity of amino acid residues at specific positions of a multiple sequence alignment. Our group is developing TABAJARA, a tool for the identification of alignment blocks that are conserved across all sequences of the alignment or discriminative between groups of sequences. These blocks are used to generate profile HMMs, which can in turn be used, in the context of virology, to identify broad taxonomic groups, such as viral families, or narrower taxa as genera or viral species. The present project aimed to apply and standardize the use of TABAJARA in different taxonomic groups of viruses, to build specific models for each of these groups and to validate these models in metagenomic data. We used three viral models for this study. The first chosen model was the Bunyavirales order, composed of mostly enveloped and spherical ssRNA(-) viruses with a segmented genome belonging to group 5 of the Baltimore classification. This group includes viruses that cause several important diseases in humans, animals and plants. The second chosen model was the Togaviridae family, composed of enveloped and spherical ssRNA(+) viruses, with a genome coding for a polyprotein, and belonging to group 4 of the Baltimore classification. This group includes the Chikungunya virus and some other viral species that cause relevant pathologies to humans and animals. Finally, we used the Spounavirinae subfamily, comprising viruses that infect a variety of bacterial hosts and that can potentially be used for phage therapy of some human bacterial diseases. These phages present non-enveloped virions with a head-to-tail structure, a dsDNA genome, and belong to group 1 of the Baltimore classification. All constructed profile HMMs were evaluated in regard to their sensitivity and specificity of detection, as well as tested in viral surveys using metagenomic data from the SRA database. The profile HMMs presented excellent performance, proving the viability of the methodology proposed in this project. The results presented in this work open the perspective of the wide use of profile HMMs as universal tools for the detection and classification of viruses in metagenomic data.
|
158 |
Contributions à la localisation intra-muros. De la modélisation à la calibration théorique et pratique d'estimateurs / Contributions to the indoor localisation. From the modelization to the theoretical and practical calibration of estimatorsDumont, Thierry 13 December 2012 (has links)
Préfigurant la prochaine grande étape dans le domaine de la navigation, la géolocalisation intra-muros est un domaine de recherche très actif depuis quelques années. Alors que la géolocalisation est entrée dans le quotidien de nombreux professionnels et particuliers avec, notamment, le guidage routier assisté, les besoins d'étendre les applications à l'intérieur se font de plus en plus pressants. Cependant, les systèmes existants se heurtent à des contraintes techniques bien supérieures à celles rencontrées à l'extérieur, la faute, notamment, à la propagation chaotique des ondes électromagnétiques dans les environnements confinés et inhomogènes. Nous proposons dans ce manuscrit une approche statistique du problème de géolocalisation d'un mobile à l'intérieur d'un bâtiment utilisant les ondes WiFi environnantes. Ce manuscrit s'articule autour de deux questions centrales : celle de la détermination des cartes de propagation des ondes WiFi dans un bâtiment donné et celle de la construction d'estimateurs des positions du mobile à l'aide de ces cartes de propagation. Le cadre statistique utilisé dans cette thèse afin de répondre à ces questions est celui des modèles de Markov cachés. Nous proposons notamment, dans un cadre paramétrique, une méthode d'inférence permettant l'estimation en ligne des cartes de propagation, sur la base des informations relevées par le mobile. Dans un cadre non-paramétrique, nous avons étudié la possibilité d'estimer les cartes de propagation considérées comme simple fonction régulière sur l'environnement à géolocaliser. Nos résultats sur l'estimation non paramétrique dans les modèles de Markov cachés permettent d'exhiber un estimateur des fonctions de propagation dont la consistance est établie dans un cadre général. La dernière partie du manuscrit porte sur l'estimation de l'arbre de contextes dans les modèles de Markov cachés à longueur variable. / Foreshadowing the next big step in the field of navigation, indoor geolocation has been a very active field of research in the last few years. While geolocation entered the life of many individuals and professionals, particularly through assisted navigation systems on roads, needs to extend the applications inside the buildings are more and more present. However, existing systems face many more technical constraints than those encountered outside, including the chaotic propagation of electromagnetic waves in confined and inhomogeneous environments. In this manuscript, we propose a statistical approach to the problem of geolocation of a mobile device inside a building, using the WiFi surrounding waves. This manuscript focuses on two central issues: the determination of WiFi wave propagation maps inside a building and the construction of estimators of the mobile's positions using these propagation maps. The statistical framework used in this thesis to answer these questions is that of hidden Markov models. We propose, in a parametric framework, an inference method for the online estimation of the propagation maps, on the basis of the informations reported by the mobile. In a nonparametric framework, we investigated the possibility of estimating the propagation maps considered as a single regular function on the environment that we wish to geolocate. Our results on the nonparametric estimation in hidden Markov models make it possible to produce estimators of the propagation functions whose consistency is established in a general framework. The last part of the manuscript deals with the estimation of the context tree in variable length hidden Markov models.
|
159 |
Contribution à la reconnaissance non-intrusive d'activités humaines / Contribution to the non-intrusive gratitude of human activitiesTrabelsi, Dorra 25 June 2013 (has links)
La reconnaissance d’activités humaines est un sujet de recherche d’actualité comme en témoignent les nombreux travaux de recherche sur le sujet. Dans ce cadre, la reconnaissance des activités physiques humaines est un domaine émergent avec de nombreuses retombées attendues dans la gestion de l’état de santé des personnes et de certaines maladies, les systèmes de rééducation, etc.Cette thèse vise la proposition d’une approche pour la reconnaissance automatique et non-intrusive d’activités physiques quotidiennes, à travers des capteurs inertiels de type accéléromètres, placés au niveau de certains points clés du corps humain. Les approches de reconnaissance d’activités physiques étudiées dans cette thèse, sont catégorisées en deux parties : la première traite des approches supervisées et la seconde étudie les approches non-supervisées. L’accent est mis plus particulièrement sur les approches non-supervisées ne nécessitant aucune labellisation des données. Ainsi, nous proposons une approche probabiliste pour la modélisation des séries temporelles associées aux données accélérométriques, basée sur un modèle de régression dynamique régi par une chaine de Markov cachée. En considérant les séquences d’accélérations issues de plusieurs capteurs comme des séries temporelles multidimensionnelles, la reconnaissance d’activités humaines se ramène à un problème de segmentation jointe de séries temporelles multidimensionnelles où chaque segment est associé à une activité. L’approche proposée prend en compte l’aspect séquentiel et l’évolution temporelle des données. Les résultats obtenus montrent clairement la supériorité de l’approche proposée par rapport aux autres approches en termes de précision de classification aussi bien des activités statiques et dynamiques, que des transitions entre activités. / Human activity recognition is currently a challengeable research topic as it can be witnessed by the extensive research works that has been conducted recently on this subject. In this context, recognition of physical human activities is an emerging domain with expected impacts in the monitoring of some pathologies and people health status, rehabilitation procedures, etc. In this thesis, we propose a new approach for the automatic recognition of human activity from raw acceleration data measured using inertial wearable sensors placed at key points of the human body. Approaches studied in this thesis are categorized into two parts : the first one deals with supervised-based approaches while the second one treats the unsupervised-based ones. The proposed unsupervised approach is based upon joint segmentation of multidimensional time series using a Hidden Markov Model (HMM) in a multiple regression context where each segment is associated with an activity. The model is learned in an unsupervised framework where no activity labels are needed. The proposed approach takes into account the sequential appearance and temporal evolution of data. The results clearly show the satisfactory results of the proposed approach with respect to other approaches in terms of classification accuracy for static, dynamic and transitional human activities
|
160 |
EM algorithm for Markov chains observed via Gaussian noise and point process information: Theory and case studiesDamian, Camilla, Eksi-Altay, Zehra, Frey, Rüdiger January 2018 (has links) (PDF)
In this paper we study parameter estimation via the Expectation Maximization (EM) algorithm for a continuous-time hidden Markov model with diffusion and point process observation. Inference problems of this type arise for instance in credit risk modelling. A key step in the application of the EM algorithm is the derivation of finite-dimensional filters for the quantities that are needed in the E-Step of the algorithm. In this context we obtain exact, unnormalized and robust filters, and we discuss their numerical implementation. Moreover, we propose several goodness-of-fit tests for hidden Markov models with Gaussian noise and point process observation. We run an extensive simulation study to test speed and accuracy of our methodology. The paper closes with an application to credit risk: we estimate the parameters of a hidden Markov model for credit quality where the observations consist of rating transitions and credit spreads for US corporations.
|
Page generated in 0.0739 seconds