Spelling suggestions: "subject:"gene finding"" "subject:"ene finding""
11 |
Gene finding in eukaryotic genomes using external information and machine learning techniquesBurns, Paul D. 20 September 2013 (has links)
Gene finding in eukaryotic genomes is an essential part of a comprehensive approach to modern systems biology. Most methods developed in the past rely on a combination of computational prediction and external information about gene structures from transcript sequences and comparative genomics. In the past, external sequence information consisted of a combination of full-length cDNA and expressed sequence tag (EST) sequences. Much improvement in prediction of genes and gene isoforms is promised by availability of RNA-seq data. However, productive use of RNA-seq for gene prediction has been difficult due to challenges associated with mapping RNA-seq reads which span splice junctions to prevalent splicing noise in the cell. This work addresses this difficulty with the development of methods and implementation of two new pipelines: 1/ a novel pipeline for accurate mapping of RNA-seq reads to compact genomes and 2/ a pipeline for prediction of genes using the RNA-seq spliced alignments in eukaryotic genomes. Machine learning methods are employed in order to overcome errors associated with the process of mapping short RNA-seq reads across introns and using them for determining sequence model parameters for gene prediction. In addition to the development of these new methods, genome annotation work was performed on several plant genome projects.
|
12 |
Efficient algorithms for the identification of miRNA motifs in DNA sequencesMendes, Nuno D 06 June 2011 (has links) (PDF)
Unravelling biological processes is dependent on the adequate modelling of regulatory mechanisms that determine the timing and spatial patterns of gene expression. In the last decade, a novel regulatory mechanism has been discovered and its biological importance has been increasingly recognised. This mechanism is mediated by RNA molecules named miRNAs that are the product of the maturation of non-coding gene transcripts and act post- transcriptionally usually to dampen or abolish the expression of protein-coding genes. Despite having eluded detection for such a long time, it is now clear that the elucidation of the expression pattern of many genes cannot be achieved without incorporating the effects of miRNA-mediated regulation. The technical difficulties that the experimental detection of these regulators entailed prompted the development of increasingly sophisticated computational approaches. Gene finding strategies originally developed for coding genes cannot be applied since these non- coding molecules are subject to very different sequence restraints and are too short to exhibit statistical properties that can be easily distinguished from the background. As a result, com- putational tools came to rely heavily on the identification of conserved sequences, distant homologs and machine learning techniques. Recent developments in sequencing technology have overcome some of the limitations of earlier experimental approaches, but pose new computational challenges. At present, the identification of new miRNA genes is therefore the result of the use of several approaches, both computational and experimental. In spite of the advancement that this research field has known in the last several years, we are still not able to formally and rigourously characterise miRNA genes in order to identify whichever sequence, structure or contextual requirements are needed to turn a DNA sequence into a functional miRNA. Efforts using computational algorithms towards the enumeration of the full set of miRNAs of an organism have been limited by strong reliance on arguments of precursor conservation and feature similarity. However, miRNA precursors may arise anew or be lost across the evolutionary history of a species and a newly-sequenced genome may be evolutionarily too distant from other genomes for an adequate comparative analysis. In addition, the learning of intricate classification rules based purely on features shared by miRNA precursors that are currently known may reflect a perpetuating identification bias rather than a sound means to tell true miRNAs from other genomic stem-loops. In this thesis, we present a strategy to sieve through the vast amount of stem-loops found in metazoan genomes in search of pre-miRNAs, significantly reducing the set of candidates while retaining most known miRNA precursors. Our approach relies on precursor properties derived from the current knowledge of miRNA biogenesis, analysis of the precursor structure and incorporation of information about the transcription potential of each candidate. i Our approach has been applied to the genomes of Drosophila melanogaster and Anophe- les gambiae, which has allowed us to show that there is a strong bias amongst annotated pre-miRNAs towards robust stem-loops in these genomes and to propose a scoring scheme for precursor candidates which combines four robustness measures. Additionally, we have identified several known pre-miRNA homologs in the newly-sequenced Anopheles darlingi and shown that most are found amongst the top-scoring precursor candidates for that or- ganism, with respect to the combined score. The structural analysis of our candidates and the identification of the region of the structural space where known precursors are usually found allowed us to eliminate several candidates, but also showed that there is a staggering number of genomic stem-loops which seem to fulfil the stability, robustness and structural requirements indicating that additional evidence is needed to identify functional precursors. To this effect, we have introduced different strategies to evaluate the transcription potential of the remaining candidates which vary according to the information which is available for the dataset under study.
|
13 |
Unsupervised and semi-supervised training methods for eukaryotic gene predictionTer-Hovhannisyan, Vardges 17 November 2008 (has links)
This thesis describes new gene finding methods for eukaryotic gene prediction. The current methods for deriving model parameters for gene prediction algorithms are based on curated or experimentally validated set of genes or gene elements. These training sets often require time and additional expert efforts especially for the species that are in the initial stages of genome sequencing. Unsupervised training allows determination of model parameters from anonymous genomic sequence with. The importance and the practical applicability of the unsupervised training is critical for ever growing rate of eukaryotic genome sequencing.
Three distinct training procedures are developed for diverse group of eukaryotic species. GeneMark-ES is developed for species with strong donor and acceptor site signals such as Arabidopsis thaliana, Caenorhabditis elegans and Drosophila melanogaster. The second version of the algorithm, GeneMark-ES-2, introduces enhanced intron model to better describe the gene structure of fungal species with posses with relatively weak donor and acceptor splice sites and well conserved branch point signal. GeneMark-LE, semi-supervised training approach is designed for eukaryotic species with small number of introns.
The results indicate that the developed unsupervised training methods perform well as compared to other training methods and as estimated from the set of genes supported by EST-to-genome alignments.
Analysis of novel genomes reveals interesting biological findings and show that several candidates of under-annotated and over-annotated fungal species are present in the current set of annotated of fungal genomes.
|
14 |
Detekce genů v DNA sekvencích / Gene Detection in DNA SequencesRoubalík, Zbyněk January 2011 (has links)
Gene detection in DNA sequences is one of the most difficult problems, which have been currently solved in bioinformatics. This thesis deals with gene detection in DNA sequences with methods using Hidden Markov Models. It contains a brief description of the fundamental principles of molecular biology, explains how genetic information is stored in DNA sequences, as well as the theoretical basis of the Hidden Markov Models. Further is described subsequent approach in the design of specific Hidden Markov Models for solving the problem of gene detection in DNA sequences. Is designed and implemented application, which uses previously designed Hidden Markov model for gene detection. This application is tested on the real data, results of these tests are discussed in the end of the thesis, as well as the possible extension and continuation of the project.
|
15 |
Comparative analysis of eukaryotic gene sequence featuresAbril Ferrando, Josep Francesc 17 May 2005 (has links)
L'incessant augment del nombre de seqüències genòmiques, juntament amb l'increment del nombre de tècniques experimentals de les que es disposa, permetrà obtenir el catàleg complet de les funcions cel.lulars de diferents organismes, incloent-hi la nostra espècie. Aquest catàleg definirà els fonaments sobre els que es podrà entendre millor com els organismes funcionen a nivell molecular. Al mateix temps es tindran més pistes sobre els canvis que estan associats amb les malalties. Per tant, la seqüència en brut, tal i com s'obté dels projectes de seqüenciació de genomes, no té cap valor sense les anàlisis i la subsegüent anotació de les característiques que defineixen aquestes funcions. Aquesta tesi presenta la nostra contribució en tres aspectes relacionats de l'anotació dels gens en genomes eucariotes. Primer, la comparació a nivell de seqüència entre els genomes humà i de ratolí es va dur a terme mitjançant un protocol semi-automàtic. El programa de predicció de gens SGP2 es va desenvolupar a partir d'elements d'aquest protocol. El concepte al darrera de l'SGP2 és que les regions de similaritat obtingudes amb el programa TBLASTX, es fan servir per augmentar la puntuació dels exons predits pel programa geneid, amb el que s obtenen conjunts d'anotacions més acurats d'estructures gèniques. SGP2 té una especificitat que és prou gran com per que es puguin validar experimentalment via RT-PCR. La validació de llocs d'splicing emprant la tècnica de la RT-PCR és un bon exemple de com la combinació d'aproximacions computacionals i experimentals produeix millors resultats que per separat. S'ha dut a terme l'anàlisi descriptiva a nivell de seqüència dels llocs d'splicing obtinguts sobre un conjunt fiable de gens ortòlegs per humà, ratolí, rata i pollastre. S'han explorat les diferències a nivell de nucleòtid entre llocs U2 i U12, pel conjunt d'introns ortòlegs que se'n deriva d'aquests gens. S'ha trobat que els senyals d'splicing ortòlegs entre humà i rossegadors, així com entre rossegadors, estan més conservats que els llocs no relacionats. Aquesta conservació addicional pot ser explicada però a nivell de conservació basal dels introns. D'altra banda, s'ha detectat més conservació de l'esperada entre llocs d'splicing ortòlegs entre mamífers i pollastre. Els resultats obtinguts també indiquen que les classes intròniques U2 i U12 han evolucionat independentment des de l'ancestre comú dels mamífers i les aus. Tampoc s'ha trobat cap cas convincent d'interconversió entre aquestes dues classes en el conjunt d'introns ortòlegs generat, ni cap cas de substitució entre els subtipus AT-AC i GT-AG d'introns U12. Al contrari, el pas de GT-AG a GC-AG, i viceversa, en introns U2 no sembla ser inusual. Finalment, s'han implementat una sèrie d'eines de visualització per integrar anotacions obtingudes pels programes de predicció de gens i per les anàlisis comparatives sobre genomes. Una d'aquestes eines, el gff2ps, s'ha emprat en la cartografia dels genomes humà, de la mosca del vinagre i del mosquit de la malària, entre d'altres. El programa gff2aplot i els filtres associats, han facilitat la tasca d'integrar anotacions de seqüència amb els resultats d'eines per la cerca d'homologia, com ara el BLAST. S'ha adaptat també el concepte de pictograma a l'anàlisi comparativa de llocs d splicing ortòlegs, amb el desenvolupament del programa compi. / El aumento incesante del número de secuencias genómicas, junto con el incremento del número de técnicas experimentales de las que se dispone, permitirá la obtención del catálogo completo de las funciones celulares de los diferentes organismos, incluida nuestra especie. Este catálogo definirá las bases sobre las que se pueda entender mejor el funcionamiento de los organismos a nivel molecular. Al mismo tiempo, se obtendrán más pistas sobre los cambios asociados a enfermedades. Por tanto, la secuencia en bruto, tal y como se obtiene en los proyectos de secuenciación masiva, no tiene ningún valor sin los análisis y la posterior anotación de las características que definen estas funciones. Esta tesis presenta nuestra contribución a tres aspectos relacionados de la anotación de los genes en genomas eucariotas. Primero, la comparación a nivel de secuencia entre el genoma humano y el de ratón se llevó a cabo mediante un protocolo semi-automático. El programa de predicción de genes SGP2 se desarrolló a partir de elementos de dicho protocolo. El concepto sobre el que se fundamenta el SGP2 es que las regiones de similaridad obtenidas con el programa TBLASTX, se utilizan para aumentar la puntuación de los exones predichos por el programa geneid, con lo que se obtienen conjuntos más precisos de anotaciones de estructuras génicas. SGP2 tiene una especificidad suficiente como para validar esas anotaciones experimentalmente vía RT-PCR. La validación de los sitios de splicing mediante el uso de la técnica de la RT-PCR es un buen ejemplo de cómo la combinación de aproximaciones computacionales y experimentales produce mejores resultados que por separado. Se ha llevado a cabo el análisis descriptivo a nivel de secuencia de los sitios de splicing obtenidos sobre un conjunto fiable de genes ortólogos para humano, ratón, rata y pollo. Se han explorado las diferencias a nivel de nucleótido entre sitios U2 y U12 para el conjunto de intrones ortólogos derivado de esos genes. Se ha visto que las señales de splicing ortólogas entre humanos y roedores, así como entre roedores, están más conservadas que las no ortólogas. Esta conservación puede ser explicada en parte a nivel de conservación basal de los intrones. Por otro lado, se ha detectado mayor conservación de la esperada entre sitios de splicing ortólogos entre mamíferos y pollo. Los resultados obtenidos indican también que las clases intrónicas U2 y U12 han evolucionado independientemente desde el ancestro común de mamíferos y aves. Tampoco se ha hallado ningún caso convincente de interconversión entre estas dos clases en el conjunto de intrones ortólogos generado, ni ningún caso de substitución entre los subtipos AT-AC y GT-AG en intrones U12. Por el contrario, el paso de GT-AG a GC-AG, y viceversa, en intrones U2 no parece ser inusual. Finalmente, se han implementado una serie de herramientas de visualización para integrar anotaciones obtenidas por los programas de predicción de genes y por los análisis comparativos sobre genomas. Una de estas herramientas, gff2ps, se ha utilizado para cartografiar los genomas humano, de la mosca del vinagre y del mosquito de la malaria. El programa gff2aplot y los filtros asociados, han facilitado la tarea de integrar anotaciones a nivel de secuencia con los resultados obtenidos por herramientas de búsqueda de homología, como BLAST. Se ha adaptado también el concepto de pictograma al análisis comparativo de los sitios de splicing ortólogos, con el desarrollo del programa compi. / The constantly increasing amount of available genome sequences, along with an increasing number of experimental techniques, will help to produce the complete catalog of cellular functions for different organisms, including humans. Such a catalog will define the base from which we will better understand how organisms work at the molecular level. At the same time it will shed light on which changes are associated with disease. Therefore, the raw sequence from genome sequencing projects is worthless without the complete analysis and further annotation of the genomic features that define those functions. This dissertation presents our contribution to three related aspects of gene annotation on eukaryotic genomes. First, a comparison at sequence level of human and mouse genomes was performed by developing a semi-automatic analysis pipeline. The SGP2 gene-finding tool was developed from procedures used in this pipeline. The concept behind SGP2 is that similarity regions obtained by TBLASTX are used to increase the score of exons predicted by geneid, in order to produce a more accurate set of gene structures. SGP2 provides a specificity that is high enough for its predictions to be experimentally verified by RT-PCR. The RT-PCR validation of predicted splice junctions also serves as example of how combined computational and experimental approaches will yield the best results. Then, we performed a descriptive analysis at sequence level of the splice site signals from a reliable set of orthologous genes for human, mouse, rat and chicken. We have explored the differences at nucleotide sequence level between U2 and U12 for the set of orthologous introns derived from those genes. We found that orthologous splice signals between human and rodents and within rodents are more conserved than unrelated splice sites. However, additional conservation can be explained mostly by background intron conservation. Additional conservation over background is detectable in orthologous mammalian and chicken splice sites. Our results also indicate that the U2 and U12 intron classes have evolved independently since the split of mammals and birds. We found neither convincing case of interconversion between these two classes in our sets of orthologous introns, nor any single case of switching between AT-AC and GT-AG subtypes within U12 introns. In contrast, switching between GT-AG and GC-AG U2 subtypes does not appear to be unusual. Finally, we implemented visualization tools to integrate annotation features for gene- finding and comparative analyses. One of those tools, gff2ps, was used to draw the whole genome maps for human, fruitfly and mosquito. gff2aplot and the accompanying parsers facilitate the task of integrating sequence annotations with the output of homologybased tools, like BLAST.We have also adapted the concept of pictograms to the comparative analysis of orthologous splice sites, by developing compi.
|
Page generated in 0.077 seconds