21 |
Improving the performance of Hierarchical Hidden Markov Models on Information Extraction tasksChou, Lin-Yi January 2006 (has links)
This thesis presents novel methods for creating and improving hierarchical hidden Markov models. The work centers around transforming a traditional tree structured hierarchical hidden Markov model (HHMM) into an equivalent model that reuses repeated sub-trees. This process temporarily breaks the tree structure constraint in order to leverage the benefits of combining repeated sub-trees. These benefits include lowered cost of testing and an increased accuracy of the final model-thus providing the model with greater performance. The result is called a merged and simplified hierarchical hidden Markov model (MSHHMM). The thesis goes on to detail four techniques for improving the performance of MSHHMMs when applied to information extraction tasks, in terms of accuracy and computational cost. Briefly, these techniques are: a new formula for calculating the approximate probability of previously unseen events; pattern generalisation to transform observations, thus increasing testing speed and prediction accuracy; restructuring states to focus on state transitions; and an automated flattening technique for reducing the complexity of HHMMs. The basic model and four improvements are evaluated by applying them to the well-known information extraction tasks of Reference Tagging and Text Chunking. In both tasks, MSHHMMs show consistently good performance across varying sizes of training data. In the case of Reference Tagging, the accuracy of the MSHHMM is comparable to other methods. However, when the volume of training data is limited, MSHHMMs maintain high accuracy whereas other methods show a significant decrease. These accuracy gains were achieved without any significant increase in processing time. For the Text Chunking task the accuracy of the MSHHMM was again comparable to other methods. However, the other methods incurred much higher processing delays compared to the MSHHMM. The results of these practical experiments demonstrate the benefits of the new method-increased accuracy, lower computation costs, and better performance.
|
22 |
Evaluation of evidence for autocorrelated data, with an example relating to traces of cocaine on banknotesWilson, Amy Louise January 2014 (has links)
Much research in recent years for evidence evaluation in forensic science has focussed on methods for determining the likelihood ratio in various scenarios. One proposition concerning the evidence is put forward by the prosecution and another is put forward by the defence. The likelihood of each of these two propositions is calculated, given the evidence. The likelihood ratio, or value of the evidence, is then given by the ratio of the likelihoods associated with these two propositions. The aim of this research is twofold. Firstly, it is intended to provide methodology for the evaluation of the likelihood ratio for continuous autocorrelated data. The likelihood ratio is evaluated for two such scenarios. The first is when the evidence consists of data which are autocorrelated at lag one. The second, an extension to this, is when the observed evidential data are also believed to be driven by an underlying latent Markov chain. Two models have been developed to take these attributes into account, an autoregressive model of order one and a hidden Markov model, which does not assume independence of adjacent data points conditional on the hidden states. A nonparametric model which does not make a parametric assumption about the data and which accounts for lag one autocorrelation is also developed. The performance of these three models is compared to the performance of a model which assumes independence of the data. The second aim of the research is to develop models to evaluate evidence relating to traces of cocaine on banknotes, as measured by the log peak area of the ion count for cocaine product ion m/z 105, obtained using tandem mass spectrometry. Here, the prosecution proposition is that the banknotes are associated with a person who is involved with criminal activity relating to cocaine and the defence proposition is the converse, which is that the banknotes are associated with a person who is not involved with criminal activity relating to cocaine. Two data sets are available, one of banknotes seized in criminal investigations and associated with crime involving cocaine, and one of banknotes from general circulation. Previous methods for the evaluation of this evidence were concerned with the percentage of banknotes contaminated or assumed independence of measurements of quantities of cocaine on adjacent banknotes. It is known that nearly all banknotes have traces of cocaine on them and it was found that there was autocorrelation within samples of banknotes so thesemethods are not appropriate. The models developed for autocorrelated data are applied to evidence relating to traces of cocaine on banknotes; the results obtained for each of the models are compared using rates of misleading evidence, Tippett plots and scatter plots. It is found that the hiddenMarkov model is the best choice for themodelling of cocaine traces on banknotes because it has the lowest rate of misleading evidence and it also results in likelihood ratios which are large enough to give support to the prosecution proposition for some samples of banknotes seized from crime scenes. Comparison of the results obtained for models which take autocorrelation into account with the results obtained from the model which assumes independence indicate that not accounting for autocorrelation can result in the overstating of the likelihood ratio.
|
23 |
Automated protein-family classification based on hidden Markov modelsFrisk, Christoffer January 2015 (has links)
The aim of the project presented in this paper was to investigate the possibility toautomatically sub-classify the superfamily of Short-chain Dehydrogenase/Reductases (SDR).This was done based on an algorithm previously designed to sub-classify the superfamily ofMedium-chain Dehydrogenase/Reductases (MDR). While the SDR-family is interesting andimportant to sub-classify there was also a focus on making the process as automatic aspossible so that future families also can be classified using the same methods.To validate the results generated it was compared to previous sub-classifications done on theSDR-family. The results proved promising and the work conducted here can be seen as a goodinitial part of a more comprehensive full investigation
|
24 |
Secure Telemetry: Attacks and Counter Measures on iNETOdesanmi, Abiola, Moten, Daryl 10 1900 (has links)
ITC/USA 2011 Conference Proceedings / The Forty-Seventh Annual International Telemetering Conference and Technical Exhibition / October 24-27, 2011 / Bally's Las Vegas, Las Vegas, Nevada / iNet is a project aimed at improving and modernizing telemetry systems by moving from a link to a networking solution. Changes introduce new risks and vulnerabilities. The nature of the security of the telemetry system changes when the elements are in an Ethernet and TCP/IP network configuration. The network will require protection from intrusion and malware that can be initiated internal to, or external of the network boundary. In this paper we will discuss how to detect and counter FTP password attacks using the Hidden Markov Model for intrusion detection. We intend to discover and expose the more subtle iNet network vulnerabilities and make recommendations for a more secure telemetry environment.
|
25 |
Evidence Combination in Hidden Markov Models for Gene PredictionBrejova, Bronislava January 2005 (has links)
This thesis introduces new techniques for finding genes in genomic sequences. Genes are regions of a genome encoding proteins of an organism. Identification of genes in a genome is an important step in the annotation process after a new genome is sequenced. The prediction accuracy of gene finding can be greatly improved by using experimental evidence. This evidence includes homologies between the genome and databases of known proteins, or evolutionary conservation of genomic sequence in different species. <br /><br /> We propose a flexible framework to incorporate several different sources of such evidence into a gene finder based on a hidden Markov model. Various sources of evidence are expressed as partial probabilistic statements about the annotation of positions in the sequence, and these are combined with the hidden Markov model to obtain the final gene prediction. The opportunity to use partial statements allows us to handle missing information transparently and to cope with the heterogeneous character of individual sources of evidence. On the other hand, this feature makes the combination step more difficult. We present a new method for combining partial probabilistic statements and prove that it is an extension of existing methods for combining complete probability statements. We evaluate the performance of our system and its individual components on data from the human and fruit fly genomes. <br /><br /> The use of sequence evolutionary conservation as a source of evidence in gene finding requires efficient and sensitive tools for finding similar regions in very long sequences. We present a method for improving the sensitivity of existing tools for this task by careful modeling of sequence properties. In particular, we build a hidden Markov model representing a typical homology between two protein coding regions and then use this model to optimize a component of a heuristic algorithm called a spaced seed. The seeds that we discover significantly improve the accuracy and running time of similarity search in protein coding regions, and are directly applicable to our gene finder.
|
26 |
A Bayesian hierarchical nonhomogeneous hidden Markov model for multisite streamflow reconstructionsBracken, C., Rajagopalan, B., Woodhouse, C. 10 1900 (has links)
In many complex water supply systems, the next generation of water resources planning models will require simultaneous probabilistic streamflow inputs at multiple locations on an interconnected network. To make use of the valuable multicentury records provided by tree-ring data, reconstruction models must be able to produce appropriate multisite inputs. Existing streamflow reconstruction models typically focus on one site at a time, not addressing intersite dependencies and potentially misrepresenting uncertainty. To this end, we develop a model for multisite streamflow reconstruction with the ability to capture intersite correlations. The proposed model is a hierarchical Bayesian nonhomogeneous hidden Markov model (NHMM). A NHMM is fit to contemporary streamflow at each location using lognormal component distributions. Leading principal components of tree rings are used as covariates to model nonstationary transition probabilities and the parameters of the lognormal component distributions. Spatial dependence between sites is captured with a Gaussian elliptical copula. Parameters of the model are estimated in a fully Bayesian framework, in that marginal posterior distributions of all the parameters are obtained. The model is applied to reconstruct flows at 20 sites in the Upper Colorado River Basin (UCRB) from 1473 to 1906. Many previous reconstructions are available for this basin, making it ideal for testing this new method. The results show some improvements over regression-based methods in terms of validation statistics. Key advantages of the Bayesian NHMM over traditional approaches are a dynamic representation of uncertainty and the ability to make long multisite simulations that capture at-site statistics and spatial correlations between sites.
|
27 |
Meta State Generalized Hidden Markov Model for Eukaryotic Gene Structure IdentificationBaribault, Carl 20 December 2009 (has links)
Using a generalized-clique hidden Markov model (HMM) as the starting point for a eukaryotic gene finder, the objective here is to strengthen the signal information at the transitions between coding and non-coding (c/nc) regions. This is done by enlarging the primitive hidden states associated with individual base labeling (as exon, intron, or junk) to substrings of primitive hidden states or footprint states. Moreover, the allowed footprint transitions are restricted to those that include either one c/nc transition or none at all. (This effectively imposes a minimum length on exons and the other regions.) These footprint states allow the c/nc transitions to be seen sooner and have their contributions to the gene-structure identification weighted more heavily – yet contributing as such with a natural weighting determined by the HMM model itself according to the training data – rather than via introducing an artificial gain-parameter tuning on major transitions. The selection of the generalized HMM model is interpolated to highest Markov order on emission probabilities, and to highest Markov order (subsequence length) on the footprint states. The former is accomplished via simple count cutoff rules, the latter via an identification of anomalous base statistics near the major transitions using Shannon entropy. Preliminary indications, from applications to the C. elegans genome, are that the sensitivity/specificity (SN/SP) result for both the individual state and full exon predictions are greatly enhanced using the generalized-clique HMM when compared to the standard HMM. Here the standard HMM is represented by the choice of the smallest size of footprint state in the generalized-clique HMM. Even with these improvements, we observe that both extremely long and short exon and intron segments would go undetected without an explicit model of the duration of state. The key contributions of this effort are the full derivation and experimental confirmation of a rudimentary, yet powerful and competitive gene finding method based on a higher order hidden Markov model. With suitable extensions, this method is expected to provide superior gene finding capability – not only in the context of pre-conditioned data sets as in the evaluations cited but also in the wider context of less preconditioned and/or raw genomic data.
|
28 |
Comparação de modelos estatísticos para estimação do intervalo de tempos entre ultrapasses de um limiar de temperatura na cidade de P. Prudente-SP /Alvaro, Maria Magdalena Kcala January 2019 (has links)
Orientador: Mário Hissamitsu Tarumoto / Resumo: A observação de fenômenos naturais, como as mudanças de temperatura é bastante frequente no mundo atual, de forma que vários estudos têm sido realizados com o intuito de prever a ocorrência delas tendo em vista o que ocorreu no passado. Estudos desta natureza, em que a coleta de dados ocorre de forma contínua, seja por medida horária ou diária, não apresenta independência entre as observações. Entre as possíveis formas de análise, há a aplicação de técnicas de séries temporais ou também a teoria dos valores extremos. No entanto, um dos objetivos deste estudo é construir uma matriz de transição, de tal forma que possamos determinar a probabilidade, por exemplo, de alta temperatura amanhã, dado que hoje foi observado este fenômeno. Para a obtenção deste resultado, uma possibilidade é construir um modelo baseado em dados dependentes que seguem um processo de Markov, em que a suposição é de que exista dependência somente com o dia anterior. Neste trabalho, pretendemos construir este modelo e realizar a aplicação em dados de temperatura na cidade de Presidente Prudente-SP no período de janeiro de 2011 a dezembro de 2016. Posteriormente vamos realizar comparações entre o modelo markoviano de nido a partir da distribuição Weibull bivariada de Marshall e Olkin e outros modelos markovianos de nidos a partir das funções cópulas. / Abstract: The observation of natural phenomena, such as temperature changes, is quite frequent in the world today, so that several studies have been carried out with the intention of predicting their occurrence in view of what has happened in the past. Data of this nature, in which the data collection occurs continuously, whether by hourly or daily measurement, does not present independence between observations. Among the possible forms of analysis is the application of time-series techniques, however, the purpose of this study is to construct a transition matrix, so that we can determine the probability, for example, of high temperature tomorrow, since today this phenomenon was observed. To obtain this result, one possibility is to construct a model based on dependent data that follows a Markov process, in which the assumption is that there is dependence only with the previous day. In this work, we intend to build this model and perform the application on temperature data in the city of Presidente Prudente-SP from January 2011 to December 2016. For which comparisons were made between the Markovian model de ned from the distribution Weibull bivariate of Marshall and Olkin and other Markovian models de ned from the copula functions. / Mestre
|
29 |
Formalizing life : Towards an improved understanding of the sequence-structure relationship in alpha-helical transmembrane proteinsViklund, Håkan January 2007 (has links)
<p>Genes coding for alpha-helical transmembrane proteins constitute roughly 25% of the total number of genes in a typical organism. As these proteins are vital parts of many biological processes, an improved understanding of them is important for achieving a better understanding of the mechanisms that constitute life.</p><p>All proteins consist of an amino acid sequence that fold into a three-dimensional structure in order to perform its biological function. The work presented in this thesis is directed towards improving the understanding of the relationship between sequence and structure for alpha-helical transmembrane proteins. Specifically, five original methods for predicting the topology of alpha-helical transmembrane proteins have been developed: PRO-TMHMM, PRODIV-TMHMM, OCTOPUS, Toppred III and SCAMPI. </p><p>A general conclusion from these studies is that approaches that use multiple sequence information achive the best prediction accuracy. Further, the properties of reentrant regions have been studied, both with respect to sequence and structure. One result of this study is an improved definition of the topological grammar of transmembrane proteins, which is used in OCTOPUS and shown to further improve topology prediction. Finally, Z-coordinates, an alternative system for representation of topological information for transmembrane proteins that is based on distance to the membrane center has been introduced, and a method for predicting Z-coordinates from amino acid sequence, Z-PRED, has been developed.</p>
|
30 |
Formalizing life : Towards an improved understanding of the sequence-structure relationship in alpha-helical transmembrane proteinsViklund, Håkan January 2007 (has links)
Genes coding for alpha-helical transmembrane proteins constitute roughly 25% of the total number of genes in a typical organism. As these proteins are vital parts of many biological processes, an improved understanding of them is important for achieving a better understanding of the mechanisms that constitute life. All proteins consist of an amino acid sequence that fold into a three-dimensional structure in order to perform its biological function. The work presented in this thesis is directed towards improving the understanding of the relationship between sequence and structure for alpha-helical transmembrane proteins. Specifically, five original methods for predicting the topology of alpha-helical transmembrane proteins have been developed: PRO-TMHMM, PRODIV-TMHMM, OCTOPUS, Toppred III and SCAMPI. A general conclusion from these studies is that approaches that use multiple sequence information achive the best prediction accuracy. Further, the properties of reentrant regions have been studied, both with respect to sequence and structure. One result of this study is an improved definition of the topological grammar of transmembrane proteins, which is used in OCTOPUS and shown to further improve topology prediction. Finally, Z-coordinates, an alternative system for representation of topological information for transmembrane proteins that is based on distance to the membrane center has been introduced, and a method for predicting Z-coordinates from amino acid sequence, Z-PRED, has been developed.
|
Page generated in 0.0759 seconds