• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 1
  • Tagged with
  • 2
  • 2
  • 2
  • 2
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
1

Meta State Generalized Hidden Markov Model for Eukaryotic Gene Structure Identification

Baribault, Carl 20 December 2009 (has links)
Using a generalized-clique hidden Markov model (HMM) as the starting point for a eukaryotic gene finder, the objective here is to strengthen the signal information at the transitions between coding and non-coding (c/nc) regions. This is done by enlarging the primitive hidden states associated with individual base labeling (as exon, intron, or junk) to substrings of primitive hidden states or footprint states. Moreover, the allowed footprint transitions are restricted to those that include either one c/nc transition or none at all. (This effectively imposes a minimum length on exons and the other regions.) These footprint states allow the c/nc transitions to be seen sooner and have their contributions to the gene-structure identification weighted more heavily – yet contributing as such with a natural weighting determined by the HMM model itself according to the training data – rather than via introducing an artificial gain-parameter tuning on major transitions. The selection of the generalized HMM model is interpolated to highest Markov order on emission probabilities, and to highest Markov order (subsequence length) on the footprint states. The former is accomplished via simple count cutoff rules, the latter via an identification of anomalous base statistics near the major transitions using Shannon entropy. Preliminary indications, from applications to the C. elegans genome, are that the sensitivity/specificity (SN/SP) result for both the individual state and full exon predictions are greatly enhanced using the generalized-clique HMM when compared to the standard HMM. Here the standard HMM is represented by the choice of the smallest size of footprint state in the generalized-clique HMM. Even with these improvements, we observe that both extremely long and short exon and intron segments would go undetected without an explicit model of the duration of state. The key contributions of this effort are the full derivation and experimental confirmation of a rudimentary, yet powerful and competitive gene finding method based on a higher order hidden Markov model. With suitable extensions, this method is expected to provide superior gene finding capability – not only in the context of pre-conditioned data sets as in the evaluations cited but also in the wider context of less preconditioned and/or raw genomic data.
2

Unsupervised hidden Markov model for automatic analysis of expressed sequence tags

Alexsson, Andrei January 2011 (has links)
This thesis provides an in-depth analyze of expressed sequence tags (EST) that represent pieces of eukaryotic mRNA by using unsupervised hidden Markov model (HMM). ESTs are short nucleotide sequences that are used primarily for rapid identificationof new genes with potential coding regions (CDS). ESTs are made by sequencing on double-stranded cDNA and the synthesizedESTs are stored in digital form, usually in FASTA format. Since sequencing is often randomized and that parts of mRNA contain non-coding regions, some ESTs will not represent CDS.It is desired to remove these unwanted ESTs if the purpose is to identifygenes associated with CDS. Application of stochastic HMM allow identification of region contents in a EST. Softwares like ESTScanuse HMM in which a training of the HMM is done by supervised learning with annotated data. However, because there are not always annotated data at hand this thesis focus on the ability to train an HMM with unsupervised learning on data containing ESTs, both with and without CDS. But the data used for training is not annotated, i.e. the regions that an EST consists of are unknown. In this thesis a new HMM is introduced where the parameters of the HMM are in focus so that they are reasonablyconsistent with biologically important regionsof an mRNA such as the Kozak sequence, poly(A)-signals and poly(A)-tails to guide the training and decoding correctly with ESTs to proper statesin the HMM. Transition probabilities in the HMMhas been adapted so that it represents the mean length and distribution of the different regions in mRNA. Testing of the HMM's specificity and sensitivityhave been performed via BLAST by blasting each EST and compare the BLAST results with the HMM prediction results.A regression analysis test shows that the length of ESTs used when training the HMM is significantly important, the longer the better. The final resultsshows that it is possible to train an HMM with unsupervised machine learning but to be comparable to supervised machine learning as ESTScan, further expansion of the HMM is necessary such as frame-shift correction of ESTs byimproving the HMM's ability to choose correctly positioned start codons or nucleotides. Usually the false positive results are because of incorrectly positioned start codons leadingto too short CDS lengths. Since no frame-shift correction is implemented, short predicted CDS lengths are not acceptable and is hence not counted as coding regionsduring prediction. However, when there is a lack of supervised models then unsupervised HMM is a potential replacement with stable performance and able to be adapted forany eukaryotic organism.

Page generated in 0.0194 seconds