Spelling suggestions: "subject:"gene finding"" "subject:"ene finding""
1 |
Towards a Genome Reverse CompilerWarren, Andrew S. 29 November 2007 (has links)
The Genome Reverse Compiler (GRC) is an annotation tool for prokaryotic genomes. Its name and philosophy are based on analogy with a high-level programming language compiler. In this analogy, the genome is a program in a certain low-level language that humans cannot understand. Given the sequence of any prokaryotic genome, GRC produces its corresponding "high-level program"--its annotation. GRC works in a completely automatic manner, using standard input and output formats. The goal is to provide an open-source, easy-to-run, very efficient annotation program. / Master of Science
|
2 |
Evidence Combination in Hidden Markov Models for Gene PredictionBrejova, Bronislava January 2005 (has links)
This thesis introduces new techniques for finding genes in genomic sequences. Genes are regions of a genome encoding proteins of an organism. Identification of genes in a genome is an important step in the annotation process after a new genome is sequenced. The prediction accuracy of gene finding can be greatly improved by using experimental evidence. This evidence includes homologies between the genome and databases of known proteins, or evolutionary conservation of genomic sequence in different species. <br /><br /> We propose a flexible framework to incorporate several different sources of such evidence into a gene finder based on a hidden Markov model. Various sources of evidence are expressed as partial probabilistic statements about the annotation of positions in the sequence, and these are combined with the hidden Markov model to obtain the final gene prediction. The opportunity to use partial statements allows us to handle missing information transparently and to cope with the heterogeneous character of individual sources of evidence. On the other hand, this feature makes the combination step more difficult. We present a new method for combining partial probabilistic statements and prove that it is an extension of existing methods for combining complete probability statements. We evaluate the performance of our system and its individual components on data from the human and fruit fly genomes. <br /><br /> The use of sequence evolutionary conservation as a source of evidence in gene finding requires efficient and sensitive tools for finding similar regions in very long sequences. We present a method for improving the sensitivity of existing tools for this task by careful modeling of sequence properties. In particular, we build a hidden Markov model representing a typical homology between two protein coding regions and then use this model to optimize a component of a heuristic algorithm called a spaced seed. The seeds that we discover significantly improve the accuracy and running time of similarity search in protein coding regions, and are directly applicable to our gene finder.
|
3 |
Meta State Generalized Hidden Markov Model for Eukaryotic Gene Structure IdentificationBaribault, Carl 20 December 2009 (has links)
Using a generalized-clique hidden Markov model (HMM) as the starting point for a eukaryotic gene finder, the objective here is to strengthen the signal information at the transitions between coding and non-coding (c/nc) regions. This is done by enlarging the primitive hidden states associated with individual base labeling (as exon, intron, or junk) to substrings of primitive hidden states or footprint states. Moreover, the allowed footprint transitions are restricted to those that include either one c/nc transition or none at all. (This effectively imposes a minimum length on exons and the other regions.) These footprint states allow the c/nc transitions to be seen sooner and have their contributions to the gene-structure identification weighted more heavily – yet contributing as such with a natural weighting determined by the HMM model itself according to the training data – rather than via introducing an artificial gain-parameter tuning on major transitions. The selection of the generalized HMM model is interpolated to highest Markov order on emission probabilities, and to highest Markov order (subsequence length) on the footprint states. The former is accomplished via simple count cutoff rules, the latter via an identification of anomalous base statistics near the major transitions using Shannon entropy. Preliminary indications, from applications to the C. elegans genome, are that the sensitivity/specificity (SN/SP) result for both the individual state and full exon predictions are greatly enhanced using the generalized-clique HMM when compared to the standard HMM. Here the standard HMM is represented by the choice of the smallest size of footprint state in the generalized-clique HMM. Even with these improvements, we observe that both extremely long and short exon and intron segments would go undetected without an explicit model of the duration of state. The key contributions of this effort are the full derivation and experimental confirmation of a rudimentary, yet powerful and competitive gene finding method based on a higher order hidden Markov model. With suitable extensions, this method is expected to provide superior gene finding capability – not only in the context of pre-conditioned data sets as in the evaluations cited but also in the wider context of less preconditioned and/or raw genomic data.
|
4 |
Enhancements to Hidden Markov Models for Gene Finding and Other Biological ApplicationsVinar, Tomas January 2005 (has links)
In this thesis, we present enhancements of hidden Markov models for the problem of finding genes in DNA sequences. Genes are the parts of DNA that serve as a template for synthesis of proteins. Thus, gene finding is a crucial step in the analysis of DNA sequencing data. <br /><br /> Hidden Markov models are a key tool used in gene finding. Yhis thesis presents three methods for extending the capabilities of hidden Markov models to better capture the statistical properties of DNA sequences. In all three, we encounter limiting factors that lead to trade-offs between the model accuracy and those limiting factors. <br /><br /> First, we build better models for recognizing biological signals in DNA sequences. Our new models capture non-adjacent dependencies within these signals. In this case, the main limiting factor is the amount of training data: more training data allows more complex models. Second, we design methods for better representation of length distributions in hidden Markov models, where we balance the accuracy of the representation against the running time needed to find genes in novel sequences. Finally, we show that creating hidden Markov models with complex topologies may be detrimental to the prediction accuracy, unless we use more complex prediction algorithms. However, such algorithms require longer running time, and in many cases the prediction problem is NP-hard. For gene finding this means that incorporating some of the prior biological knowledge into the model would require impractical running times. However, we also demonstrate that our methods can be used for solving other biological problems, where input sequences are short. <br /><br /> As a model example to evaluate our methods, we built a gene finder ExonHunter that outperforms programs commonly used in genome projects.
|
5 |
Enhancements to Hidden Markov Models for Gene Finding and Other Biological ApplicationsVinar, Tomas January 2005 (has links)
In this thesis, we present enhancements of hidden Markov models for the problem of finding genes in DNA sequences. Genes are the parts of DNA that serve as a template for synthesis of proteins. Thus, gene finding is a crucial step in the analysis of DNA sequencing data. <br /><br /> Hidden Markov models are a key tool used in gene finding. Yhis thesis presents three methods for extending the capabilities of hidden Markov models to better capture the statistical properties of DNA sequences. In all three, we encounter limiting factors that lead to trade-offs between the model accuracy and those limiting factors. <br /><br /> First, we build better models for recognizing biological signals in DNA sequences. Our new models capture non-adjacent dependencies within these signals. In this case, the main limiting factor is the amount of training data: more training data allows more complex models. Second, we design methods for better representation of length distributions in hidden Markov models, where we balance the accuracy of the representation against the running time needed to find genes in novel sequences. Finally, we show that creating hidden Markov models with complex topologies may be detrimental to the prediction accuracy, unless we use more complex prediction algorithms. However, such algorithms require longer running time, and in many cases the prediction problem is NP-hard. For gene finding this means that incorporating some of the prior biological knowledge into the model would require impractical running times. However, we also demonstrate that our methods can be used for solving other biological problems, where input sequences are short. <br /><br /> As a model example to evaluate our methods, we built a gene finder ExonHunter that outperforms programs commonly used in genome projects.
|
6 |
Evidence Combination in Hidden Markov Models for Gene PredictionBrejova, Bronislava January 2005 (has links)
This thesis introduces new techniques for finding genes in genomic sequences. Genes are regions of a genome encoding proteins of an organism. Identification of genes in a genome is an important step in the annotation process after a new genome is sequenced. The prediction accuracy of gene finding can be greatly improved by using experimental evidence. This evidence includes homologies between the genome and databases of known proteins, or evolutionary conservation of genomic sequence in different species. <br /><br /> We propose a flexible framework to incorporate several different sources of such evidence into a gene finder based on a hidden Markov model. Various sources of evidence are expressed as partial probabilistic statements about the annotation of positions in the sequence, and these are combined with the hidden Markov model to obtain the final gene prediction. The opportunity to use partial statements allows us to handle missing information transparently and to cope with the heterogeneous character of individual sources of evidence. On the other hand, this feature makes the combination step more difficult. We present a new method for combining partial probabilistic statements and prove that it is an extension of existing methods for combining complete probability statements. We evaluate the performance of our system and its individual components on data from the human and fruit fly genomes. <br /><br /> The use of sequence evolutionary conservation as a source of evidence in gene finding requires efficient and sensitive tools for finding similar regions in very long sequences. We present a method for improving the sensitivity of existing tools for this task by careful modeling of sequence properties. In particular, we build a hidden Markov model representing a typical homology between two protein coding regions and then use this model to optimize a component of a heuristic algorithm called a spaced seed. The seeds that we discover significantly improve the accuracy and running time of similarity search in protein coding regions, and are directly applicable to our gene finder.
|
7 |
Combination of results from gene-finding programsHammar, Cecilia January 1999 (has links)
<p>Gene-finding programs available over the Internet today are shown to be nothing more than guides to possible coding regions in the DNA. The programs often do incorrect predictions. The idea of combining a number of different gene-finding programs arised a couple of years ago. Murakami and Takagi (1998) published one of the first attempts to combine results from gene-finding programs built on different techniques (e.g. artificial neural networks and hidden Markov models). The simple combinations methods used by Murakami and Takagi (1998) indicated that the prediction accuracy could be improved by a combination of programs.</p><p>In this project artificial neural networks are used to combine the results of the three well-known gene-finding programs GRAILII, FEXH, and GENSCAN. The results show a considerable increase in prediction accuracy compared to the best performing single program GENSCAN</p>
|
8 |
Combination of results from gene-finding programsHammar, Cecilia January 1999 (has links)
Gene-finding programs available over the Internet today are shown to be nothing more than guides to possible coding regions in the DNA. The programs often do incorrect predictions. The idea of combining a number of different gene-finding programs arised a couple of years ago. Murakami and Takagi (1998) published one of the first attempts to combine results from gene-finding programs built on different techniques (e.g. artificial neural networks and hidden Markov models). The simple combinations methods used by Murakami and Takagi (1998) indicated that the prediction accuracy could be improved by a combination of programs. In this project artificial neural networks are used to combine the results of the three well-known gene-finding programs GRAILII, FEXH, and GENSCAN. The results show a considerable increase in prediction accuracy compared to the best performing single program GENSCAN
|
9 |
New AB initio methods of small genome sequence interpretationMills, Ryan Edward 07 April 2006 (has links)
This thesis presents novel methods for analysis of short viral sequences and identifying biologically significant regions based on their statistical properties. The first section of this thesis describes the ab initio method for identifying genes in viral genomes of varying type, shape and size. This method uses statistical models of the viral protein-coding and non-coding regions. We have created an interactive database summarizing the results of the application of this method to viral genomes currently available in GenBank. This database, called VIOLIN, provides an access to the genes identified for each viral genome, allows for further analysis of these gene sequences and the translated proteins, and displays graphically the distribution of protein-coding potential in a viral genome.
The next two sections of this thesis describe individual projects for two specific viral genomes analyzed with the new method. The first project was devoted to the recently sequenced Herpes B virus from Rhesus macaque. This genome was initially thought to lack an ortholog of the gamma-34.5 gene encoding for a neurovirulence factor necessary for viability of the two close relatives, human herpes simplex viruses 1 and 2. The genome of Rhesus macaque Herpes B virus was annotated using the new gene finding procedure and an in-depth analysis was conducted to find a gamma-34.5 ortholog using a variety of tools for a similarity search. A profound similarity in codon usage between B virus and its host was also identified, despite the large difference in their GC contents (74% and 51%, respectively).
The last thesis section describes the analysis of the Mouse Cytomegalovirus (MCMV) genome by the combination of methods such as sequence segmentation, gene finding and protein identification by mass spectrometry. The MCMV genome is a challenging subject for statistical sequence analysis due to the heterogeneity of its protein coding regions. Therefore the MCMV genome was segmented based on its nucleotide composition and then each segment was considered independently. A thorough analysis was conducted to identify previously unnoticed genes, incorrectly annotated genes and potential sequence errors causing frameshifts. All the findings were then corroborated by the mass spectrometry analysis.
|
10 |
Improvement of ab initio methods of gene prediction in genomic and metagenomic sequencesZhu, Wenhan 06 April 2010 (has links)
A metagenome originated from a shotgun sequencing of a microbial community is a heterogeneous mixture of rather short sequences. A vast majority of microbial species in a given community (99%) are likely to be non-cultivable. Many protein-coding regions in a new metagenome are likely to code for barely detectable homologs of already known proteins. Therefore, an ab initio method that would accurately identify the new genes is a vitally important tool of metagenomic sequence analysis. However, a heuristic model method for finding genes in short prokaryotic sequences with anonymous origin was proposed in 1999 prior to the advent of metagenomics. With hundreds of new prokaryotic genomes available it is now possible to enhance the original approach and to utilize direct polynomial and logistic approximations of oligonucleotide frequencies. The idea was to bypass traditional ways of parameter estimation such as supervised training on a set of validated genes or unsupervised training on an anonymous sequence supposed to contain a large enough number of genes. The codon frequencies, critical for the model parameterization, could be derived from frequencies of nucleotides observed in the short sequence. This method could be further applied for initializing the algorithms for iterative parameters estimation for prokaryotic as well as eukaryotic gene finders.
|
Page generated in 0.0886 seconds