Spelling suggestions: "subject:"oon genomic"" "subject:"soon genomic""
381 |
Évolution de l’architecture des génomes : modélisation et reconstruction phylogénétique / Evolution of the architecture of genomes : modelling and phylogenetic reconstructionSemeria, Magali 09 December 2015 (has links)
L'évolution des génomes peut être observée à plusieurs échelles, chaque échelle révélant des processus évolutifs différents. A l'échelle de séquences ADN, il s'agit d'insertions, délétions et substitutions de nucléotides. Si l'on s'intéresse aux gènes composant les génomes, il s'agit de duplications, pertes et transferts horizontaux de gènes. Et à plus large échelle, on observe des réarrangements chromosomiques modifiant l'agencement des gènes sur les chromosomes. Reconstruire l'histoire évolutive des génomes implique donc de comprendre et de modéliser tous les processus à l'œuvre, ce qui reste hors de notre portée. A la place, les efforts de modélisation ont exploré deux directions principales. D'un côté, les méthodes de reconstruction phylogénétique se sont concentrées sur l'évolution des séquences, certaines intégrant l'évolution des familles de gènes. D'un autre côté, les réarrangements chromosomiques ont été très largement étudiés, donnant naissance à de nombreux modèles d'évolution de l'architecture des génomes. Ces deux voies de modélisation se sont rarement rencontrées jusqu'à récemment. Au cours de ma thèse, j'ai développé un modèle d'évolution de l'architecture des génomes prenant en compte l'évolution des gènes et des séquences. Ce modèle rend possible une reconstruction probabiliste de l'histoire évolutive d'adjacences et de l'ordre des gènes de génomes ancestraux en tenant compte à la fois d'évènements modifiant le contenu en gènes des génomes (duplications et pertes de gènes), et d'évènements modifiant l'architecture des génomes (les réarrangements chromosomiques). Intégrer l'information phylogénétique à la reconstruction d'ordres des gènes permet de reconstruire des histoires évolutives plus complètes. Inversement, la reconstruction d'ordres des gènes ancestraux peut aussi apporter une information complémentaire à la phylogénie et peut être utilisée comme un critère pour évaluer la qualité d'arbres de gènes, ouvrant la voie à un modèle et une reconstruction intégrative / Genomes evolve through processes that modify their content and organization at different scales, ranging from the substitution, insertion or deletion of a single nucleotide to the duplication, loss or transfer of a gene and to large scale chromosomal rearrangements. Extant genomes are the result of a combination of many such processes, which makes it difficult to reconstruct the overall picture of genome evolution. As a result, most models and methods focus on one scale and use only one kind of data, such as gene orders or sequence alignments. Most phylogenetic reconstruction methods focus on the evolution of sequences. Recently, some of these methods have been extended to integrate gene family evolution. Chromosomal rearrangements have also been extensively studied, leading to the development of many models for the evolution of the architecture of genomes. These two ways to model genome evolution have not exchanged much so far, mainly because of computational issues. In this thesis, I present a new model of evolution for the architecture of genomes that accounts for the evolution of gene families. With this model, one can reconstruct the evolutionary history of gene adjacencies and gene order accounting for events that modify the gene content of genomes (duplications and losses of genes) and for events that modify the architecture of genomes (chromosomal rearrangements). Integrating these two types of information in a single model yields more accurate evolutionary histories. Moreover, we show that reconstructing ancestral gene orders can provide feedback on the quality of gene trees thus paving the way for an integrative model and reconstruction method
|
382 |
Assembly, annotation and polymorphism analysis of a draft transcriptome sequence for a fast-growing Eucalyptus plantation treeHefer, Charles Amadeus 18 October 2011 (has links)
Ultra-high throughput DNA sequencing technologies have rapidly changed the face of genomic research projects. Technologies such as mRNA-Seq have the potential to rapidly profile the expressed gene-catalog of non-model organisms, albeit with significant bioinformatics related costs and support required. This study developed automated data analysis workflows focused on the quality evaluation of mRNA-Seq reads, de novo transcriptome assembly, transcriptome annotation and digital gene expression profiling making use of data analysis tools available in the public domain and novel tools developed for this purpose. The developed workflows were made available in a private instance of the Galaxy workflow management system. The developed workflows were used to perform the de novo assembly of a gene-catalog of a Eucalyptus plantation tree. The fast growing and good wood properties of Eucalyptus tree species and their hybrids make them excellent renewable resources of fiber for pulp and paper, and woody biomass for bioenergy production. We produced an expressed gene-catalog of 18 894 de novo assembled contigs from Illumina deep mRNA-Seq of six sampled plant tissues. Using a novel coverage-assisted re-assembly approach, we were able to assemble near full-length biologically relevant transcripts. The assembly was evaluated in terms of contig quality and contiguity, and functional annotations were assigned. Digital expression profiling (FPKM values) of each contig across the tissues were calculated, which was used to identify of tissue-specific sets of expressed genes. Polymorphism analysis of 13 806 high-confidence contigs revealed a combined exon and untranslated region SNP density of 0.534 SNPs/100 bp, which provides a good opportunity for designing high-density SNP assays in the expressed regions of the Eucalyptus genome. The assembled and annotated gene catalog was made available for public use in a user-friendly, web-based interface as the Eucspresso database (http://eucspresso.bi.up.ac.za). The developed database acts as a prelude to a more comprehensive mRNA-Seq whole-transcriptome repository, the Eucalyptus Genome Intergrative Explorer (EucGenIE), a resource that will focus on identifying transcriptional networks active during woody biomass development. Results from the study proved that current bioinformatics software tools and approaches can be used to successfully assemble and characterize a large proportion of the transcriptome of a complex eukaryotic organism. This approach can be used to characterise the gene catalog of a wide range of non-model organisms using only data derived from uHTS experiments. / Thesis (PhD)--University of Pretoria, 2011. / Biochemistry / unrestricted
|
383 |
The cost of longevity: loss of sexual function in natural clones of Populus tremuloidesAlly, Dilara 05 1900 (has links)
Most clonal plants exhibit a modular structure at multiple levels. At the level of the organs, they are characterized by functional modules, such as, internodes, leaves, branches. At the level of the genetic individual (clone or genet), they possess independent evolutionary and physiological units (ramets). These evolutionary units arise through the widespread phenomenon of clonal reproduction, achieved in a variety of ways including rhizomes, stolons, bulbils, or lateral roots. The focus of this study was Populus tremuloides, trembling aspen, a dioecious tree that reproduces sexually by seed and asexually through lateral roots. Local forest patches in western populations of Populus tremuloides consisted largely of multiple genotypes. Multi-clonal patches were dominated by a single genotype, and in one population (Riske Creek) we found several patches (five out of 17) consisting of a single genotype. A second consequence of modularity is that during the repeated cycle of ramet birth, development and death, somatic mutations have the opportunity to occur. Eventually, the clone becomes a mosaic of mutant and non-mutant cell lineages. We found that neutral somatic mutations accumulated across 14 microsatellite loci at a rate of between 10^-6 and 10^-5 per locus per year. We suggest that neutral genetic divergence, under a star phylogeny model of clonal growth, is an alternative way to estimate clone age. Previous estimates of clone age couple the mean growth rate per year of shoots with the area covered by the clone. This assumes a positive linear relationship between clone age and clone size. We found, however, no repeatable pattern across our populations in terms of the relationship of either shape or size to the number of somatic changes. A final consequence of modularity is that during clonal growth, natural selection is relaxed for traits involving sexual function. This means that mutations deleterious to sexual function can accumulate, reducing the overall sexual fitness of a clone. We coupled neutral genetic divergence within clones with pollen fitness data to infer the rate and effect of mildly deleterious mutations. Mutations reduced relative sexual fitness in clonal aspen populations by about 0.12x10^-3 to 1.01x10^-3 per year. Furthermore, the decline in sexual function with clone age is evidence that clonal organisms are vulnerable to the effects of senescence. / Medicine, Faculty of / Medical Genetics, Department of / Graduate
|
384 |
Consequences of mitotic loss of heterozygosity on genomic imprinting in mouse embryonic stem cellsElves, Rachel Leigh 11 1900 (has links)
Epigenetic differences between maternally inherited and paternally inherited chromosomes, such as CpG methylation, render the maternal and paternal genome functionally inequivalent, a phenomenon called genomic imprinting. This functional inequivalence is exemplified with imprinted genes, whose expression is parent-of-origin specific. The dosage of imprinted gene expression is disrupted in cells with uniparental disomy (UPD), which is an unequal parental contribution to the genome. I have derived mouse embryonic stem (ES) cell sub-lines with maternal UPD (mUPD) for mouse chromosome 6 (MMU6) to characterize regulation and maintenance of imprinted gene expression.
The main finding from this study is that maintenance of imprinting in mitotic UPD is extremely variable. Imprint maintenance was shown to vary from gene to gene, and to vary between ES cell lines depending on the mechanism of loss of heterozygosity (LOH) in that cell line. Certain genes analyzed, such as Peg10, Sgce, Peg1, and Mit1 showed abnormal expression in ES cell lines for which they were mUPD. These abnormal expression levels are similar to that observed in ES cells with meiotically-derived full genome mUPD (parthenogenetic ES cells).
Imprinted CpG methylation at the Peg1 promoter was found to be abnormal in all sub-lines with mUPD for Peg1. Two cell sub-lines which incurred LOH through mitotic recombination showed hypermethylation of Peg1, consistent with the presence of two maternal alleles. Surprisingly, a cell sub-line which incurred LOH through full chromosome duplication/loss showed hypomethylation of Peg1. The levels of methylation observed in these sub-lines correlates with expression, as the first two sub-lines showed a near-consistent reduction of Peg1, while the latter showed Peg1 levels close to wild-type.
Altogether these results suggest that certain imprinted genes, like Peg1 and Peg10, have stricter imprinting maintenance, and as a result show abnormal expression in UPD. This strict imprint maintenance is disrupted, however, in UPD incurred through full chromosome duplication/loss, possibly because of the trisomic intermediate stage which occurs in this mechanism. / Medicine, Faculty of / Medical Genetics, Department of / Graduate
|
385 |
Etude de la prédiction génomique chez les caprins : faisabilité et limites de la sélection génomique dans le cadre d'une population multiraciale et à faible effectif / Study on genomic predictions in dairy goats : Benefits and limits of genomic selection in a small size multibreed populationCarillier-Jacquin, Céline 16 October 2015 (has links)
La sélection génomique, qui a révolutionné la sélection génétique des bovins laitiers notamment, est désormais envisagée dans d’autres espèces comme l’espèce caprine. La clé du succès de la sélection génomique réside dans la précision des évaluations génomiques. Chez les caprins laitiers français, le gain de précision attendu avec la sélection génomique était un des questionnements de la filière en raison de la petite taille de la population de référence disponible (825 mâles et 1945 femelles génotypés sur une puce SNP 50K). Le but de cette étude est d’évaluer comment augmenter la précision des évaluations génomiques dans l’espèce caprine. Une étude de la structure génétique de la population de référence caprine constituée d’animaux de races Saanen et Alpine, a permis de montrer que la population de référence choisie est représentative de la population élevée sur le territoire français. En revanche, les faibles niveaux de déséquilibre de liaison (0,17 entre deux SNP consécutifs) de consanguinité et de parenté au sein de la population, similaires à ceux trouvés en ovins Lacaune, ne sont pas idéaux pour obtenir une bonne précision des évaluations génomiques. De plus, malgré l’origine commune des races Alpine et Saanen, leurs structures génétiques suggèrent qu’elles se distinguent clairement d’un point de vue génétique. Les méthodes génomiques (GBLUP ou Bayésiennes) « two-step », basées sur des performances pré-corrigées (DYD, EBV dérégressées) n’ont pas permis une amélioration significative des précisions des évaluations génomiques pour les caractères évalués en routine (caractères de production, de morphologie et de comptages de cellules somatiques) chez les caprins laitiers. La prise en compte des phénotypes des mâles non génotypés permet d’augmenter les précisions des évaluations de 3 à 47% selon le caractère. L’ajout des génotypes de femelles issues d’un dispositif de détection de QTL améliore également les précisions (de 2 à 14%) que ce soit pour les évaluations two steps ou les évaluations basées sur les performances propres des femelles (single step). Les précisions sont augmentées de 10 à 74% avec les évaluations single step comparées aux évaluations two steps, ce qui permet d’atteindre des précisions supérieures à celles obtenues sur ascendance. Les précisions obtenues avec les évaluations génomiques multiraciales, bicaractères et uniraciales sont similaires même si la précision des valeurs génomiques estimées des candidats à la sélection est plus élevée avec les évaluations multiraciales. La sélection génomique est donc envisageable chez les caprins laitiers français à l’aide d’un modèle génomique multiracial single step. Les précisions peuvent être légèrement augmentées par l’inclusion de gènes majeurs tels que celui de la caséine αs1 notamment à l’aide d’un modèle « gene content » pour prédire le génotype des animaux non génotypés. / Genomic selection which is revolutionizing genetic selection in dairy cattle has been tested in several species like dairy goat. Key point in genomic selection is accuracy of genomic evaluation. In French dairy goats, gain in accuracy using genomic selection was questioning due to the small size of the reference population (825 males and 1 945 females genotyped). The aim of this study was to investigate how to reach adequate genomic evaluation accuracy in French dairy goat population. The study of reference population structure (Alpine and Saanen breeds) showed that reference population is similar to the whole population of French dairy goats. But the weak level of linkage disequilibrium (0.17 between two consecutive SNP), inbreeding and relationship between reference and candidate population were not ideal to maximize genomic evaluation accuracy. Despite their common origin, genetic structure of Alpine and Saanen breeds suggested that they were genetically distant. Two steps genomic evaluation (GBLUP, Bayesian) based on performances corrected for fixed effect (DYD, deregressed EBV) did not improve genetic evaluation accuracy compared to classical evaluations for milk production traits, udder type traits and somatic cells score classically selected in French dairy goat. Taking into account phenotypes of ungenotyped sires increased genomic evaluation from 3 to 47% depending on the trait considered. Adding female genotypes also improved genomic evaluation accuracies from 2 to 4% depending on the method (two steps or single step) and on the trait. When using gemomic evaluation directly based on female performances (single step), accuracy of genomic evaluation reach the level obtained from ascendance in classic evaluation which was not the case using two steps evaluations. Genomic evaluation accuracies were similar when using multiple-trait model, multi-breed or single breed evaluation. But accuracies derived from prediction error variances were better when using multi-breed genomic evaluations. Genomic selection is feasible in French dairy goats using single step multi-breed genomic evaluations. Accuracies could be slightly improved integrating major gene as αs1 casein especially when using « gene content » approach to predict genotypes of ungenotyped animals.
|
386 |
Inferences On The Function Of Proteins And Protein-Protein Interactions Using Large Scale Sequence And Structure AnalysisKrishnadev, O 05 1900 (has links) (PDF)
No description available.
|
387 |
La diversité des espèces du groupe Mycobacterium abscessus et leurs mycobactériophages / Mycobacterium abscessus diversity and their mycobacteriophagesSassi, Mohamed 25 September 2013 (has links)
Premièrement, nous avons analysé 14 génomes publiés de M. abscessus montrant quece taxon comprend au moins cinq taxons différents spécifiés par des caractéristiques microbiologiques d’intérêt médical. Au cours d'un deuxième travail, nous avons développé une technique d’identification et de génotypage de M. abscessus qui a permis de distinguer sans ambiguïté M. massiliense de M. bolletii et M. abscessus.Nous avons ensuite analysé le bactériophage de M. bolletii que nous avons nommé Araucaria. La résolution de sa structure 3D a montré une capside et un connecteur similaires à ceux de plusieurs bactériophages de bactéries à Gram négatif et positif; et une queue hélicoïdale décorée par des pointes radiales. La partie basale (baseplate) du phage Araucaria présente des caractéristiques observées dans les phages se liant à des récepteurs de protéines. Araucaria se lie à son hôte en deux temps, un premier par liaison de la queue aux saccharides de l'hôte puis un deuxième par liaison de la baseplate aux protéines de la paroi cellulaire.Nous avons analysé la présence de séquence de phages dans 48 génomes disponibles de M. abscessus. Notre analyse phylogénétique suggère que les espèces de M. abscessus ont été infectées par différents mycobactériophages et ont une histoire évolutive différente de celle des hôtes mycobactériens et contiennent aussi des protéines acquises par transfert horizontal.Enfin, nous avons séquencé et analysé deux mycobactéries non-tuberculeuses responsables d’infections opportunistes, Mycobacterium simiae et Mycobacterium septicum. / In a first step, we reviewed the published genomes of 14 M. abscessus strains showing that M. abscessus sensu lato comprises of five different taxons specified by particular characteristics of microbiological and medical interests. In a second step, based on sequencing of eight intergenic spaces, we developed a Multispacer Sequence Typing technique (MST) for M. abscessus group sub-species identification and strain genotyping. MST clearly differentiates formerly “M. massiliense” organisms from other M. abscessus subsp. bolletii organisms. We also analyzed a bacteriophage from M. bolletii that we named Araucaria. We resolved Araucaria 3D structure, its capsid and connector share close similarity with several phages from Gram- or Gram+ bacteria. The helical tail decorated by radial spikes, possibly host adhesion devices. Its host adsorption device, at the tail tip, assembles features observed in phages binding to protein receptors. All together, these results suggest that Araucaria may infect its mycobacterial host using a mechanism involving adhesion to cell wall saccharides and protein, a feature that remains to be further explored. We also analysed 48 M. abscessus sequenced genomes for encoding prophages. Our phylogenetic analyses suggested that M. abscessus species were infected by different mycobacteriophages and have a different evolutionary history than the bacterial hosts and some proteins that are acquired by horizontal gene transfer mostly mycobacteriophages’ proteins and hypothetical proteins. Finally, we sequenced and analyzed two non-tuberculosis mycobacterium causing human infections, Mycobacterium simiaie and Mycobacterium septicum.
|
388 |
Phylogénomique des bactéries pathogènesGeorgiades, Kalliopi 08 September 2011 (has links)
La pathogénicité des bactéries a toujours été attribuée à des facteurs de virulence et les bactéries pathogènes sont considérées comme étant mieux armées, comparé à des bactéries ne provoquant pas de maladies. Selon les premières études génomiques, le fait de supprimer un certain nombre de gènes des bactéries pathogènes, limiterait leur capacité à infecter leurs hôtes. Au contraire, des études de génomique comparatives récentes, démontrent que la spécialisation des bactéries dans les cellules eucaryotes est associée à une perte de gènes massive, en particulier pour les endosymbiontes allopatriques qui sont isolés depuis longtemps dans une niche intracellulaire. En effet, les bactéries sympatriques, extracellulaires, ont souvent des génomes plus grands et présentent une résistance et une plasticité plus importante. Ces bactéries constituent, de fait, plutôt des complexes d’espèces que de vraies espèces. Certaines bactéries spécialistes, comme les bactéries pathogènes, arrivent à s’échapper de ces complexes et à coloniser une niche, bénéficiant alors d’un nom d’espèce. Leur spécialisation leur permet de devenir allopatriques et leurs pertes de gènes favorisent une évolution réductive. Ces observations nous ont conduits à réaliser une étude afin de quantifier le taux de perte de gènes lors de l’évolution de ces bactéries extracellulaires vers celle de bactéries spécialistes intracellulaires. Notre objectif était de vérifier que ce qui caractérise l’évolution des bactéries intracellulaires est bien la réduction génomique, en prenant en compte tous les événements possibles de gains de gènes. Par ailleurs, dans une étude neutre comparant les 12 espèces pandémiques les plus dangereuses pour l’homme avec les espèces non-épidémiques les plus proches, nous avons voulu identifier des spécificités génomiques associées à la capacité virulente de bactéries pathogènes et démontrer que, à part les toxines et les modules toxine-antitoxine, ce qui caractérise ces espèces ce ne sont pas les facteurs de virulence, mais la perte des gènes de régulation. Au final, les bactéries pathogènes ont un répertoire virulent dans lequel les gènes absents sont aussi importants que les gènes présents. / The virulence of pathogenic bacteria has been attributed to virulence factors and pathogenic bacteria are considered to have more genes compared to bacteria that do not cause disease. According to the first genomic studies, removing a certain number of genes from pathogenic bacteria impairs their capacity to infect hosts. However, more recent studies have demonstrated that the specialization of bacteria in eukaryotic cells is associated with massive gene loss, especially for allopatric endosymbionts that have been isolated for a long time in an intracellular niche. Indeed, bacteria living in sympatry often have bigger genomes and exhibit greater resistance and plasticity and constitute species complexes rather than true species. Specialists, including specific pathogenic bacteria, escape these bacterial complexes and colonize a niche; thereby gaining a species name. Their specialization allows them to adopt allopatric lifestyle and experience reductive genome evolution. These observations led us to design a study to quantify the rate of gene losses during the evolution of free-living bacteria to intracellular specialists. Our objective was to verify that what characterizes the evolution of intracellular bacteria is genomic reduction, taking under consideration all possible gene gain events. Furthermore, in another neutral study comparing the 12 most dangerous pandemic bacteria to Humans to their closest non-epidemic species, we wished to identify any genomic specificities associated to the virulent capacity of pathogenic bacteria and demonstrate that, besides toxins and surprisingly, toxin-antitoxin modules, pathogenic bacteria are not characterized by more virulence factors, but rather by a loss of regulatory genes. Finally, virulent bacteria exhibit a genomic repertoire in which absent genes are as important as present ones.
|
389 |
Le pouvoir pathogène chez Ralstonia solanacearum phylotype II génomique intégrative et paysages transcriptomiques en relation avec l'adaptation à l'hôte / Pathogenicity of Ralstonia solanacearum phylotype : integrative genomics and transcriptomic landscapes associated with host specificityAilloud, Florent 03 April 2015 (has links)
Ralstonia solanacearum est une bactérie phytopathogène à la gamme d'hôte exceptionnellement large et à la répartition mondiale. Cet organisme présente une biologie à facettes multiples et s'est adapté à quasiment tous les types de sols, à la vie planctonique, et à de nombreux hôtes et plantes réservoirs. Cette capacité d'adaptation est attestée par une très forte hétérogénéité des souches qui unifient ce complexe d'espèces, aussi bien au plan de la diversité génétique, phénotypique, que de la gamme d'hôte. Des approches phylogénétiques ont montré une structuration de la population mondiale en quatre phylotypes qui correspondent globalement à l'origine géographique des souches. Les travaux de thèse portent sur des souches du phylotype II qui ont valeur de modèle expérimental car épidémiologiquement inféodées à un hôte particulier : souches Moko pathogènes du bananier, souches ‘Brown rot’ adaptées à la pomme de terre et souches émergentes NPB, un variant du pouvoir pathogène. La question de recherche centrale porte sur la compréhension des mécanismes d'adaptation à l'hôte. Pour cela, une dizaine de génomes ont été séquencés dans une perspective (i) de revisiter la taxonomie de ce complexe d'espèce, (ii) d'en faire une analyse génomique comparative et (iii) d'analyser les paysages transcriptomiques produits lors de l'infection (in planta). L'ensemble des ces approches complémentaires permettent ainsi d'intégrer la complexité génétique et phénotypique de l'organisme de manière plus systémique. / Ralstonia solanacearum is a plant pathogenic bacterium globally distributed with a particularly broad host range. This organism is biologically diverse and is adapted to all types of soil, to planktonic lifestyle and to many plant hosts and natural reservoirs. This bacterium is a species complex and its genetic, phenotypic and host range diversity is a direct consequence of adaptation mechanisms. Phylogenetic analyses have divided this species complex into four distinct phylotypes correlating mostly with strains’ geographical origin. This thesis focuses on using phylotype II strains as an experimental model due to their adaptation to specific hosts: Moko strains pathogenic to banana, ‘Brown rot’ strains adapted to potatoes and emergent pathological variant NPB strains. Our main research topic is the understanding of host adaptation processes. In order to tackle this problematic we sequenced about ten genomes as a starting point of (i) a taxonomic revision of the species complex (ii) a comparative genomic analysis and (iii) an in planta transcriptomic analysis. Together, these complementary approaches allow a more systemic view of this organism’s genetic and phenotypic complexity.
|
390 |
Predicting "Essential" Genes in Microbial Genomes: A Machine Learning Approach to Knowledge Discovery in Microbial Genomic DataPalaniappan, Krishnaveni 01 January 2010 (has links)
Essential genes constitute the minimal gene set of an organism that is indispensable for its survival under most favorable conditions. The problem of accurately identifying and predicting genes essential for survival of an organism has both theoretical and practical relevance in genome biology and medicine. From a theoretical perspective it provides insights in the understanding of the minimal requirements for cellular life and plays a key role in the emerging field of synthetic biology; from a practical perspective, it facilitates efficient identification of potential drug targets (e.g., antibiotics) in novel pathogens. However, characterizing essential genes of an organism requires sophisticated experimental studies that are expensive and time consuming. The goal of this research study was to investigate machine learning methods to accurately classify/predict "essential genes" in newly sequenced microbial genomes based solely on their genomic sequence data.
This study formulates the predication of essential genes problem as a binary classification problem and systematically investigates applicability of three different supervised classification methods for this task. In particular, Decision Tree (DT), Support Vector Machine (SVM), and Artificial Neural Network (ANN) based classifier models were constructed and trained on genomic features derived solely from gene sequence data of 14 experimentally validated microbial genomes whose essential genes are known. A set of 52 relevant genomic sequence derived features (including gene and protein sequence features, protein physio-chemical features and protein sub-cellular features) was used as input for the learners to learn the classifier models. The training and test datasets used in this study reflected between-class imbalance (i.e. skewed majority class vs. minority class) that is intrinsic to this data domain and essential genes prediction problem. Two imbalance reduction techniques (homology reduction and random under sampling of 50% of the majority class) were devised without artificially balancing the datasets and compromising classifier generalizability. The classifier models were trained and evaluated using 10-fold stratified cross validation strategy on both the full multi-genome datasets and its class imbalance reduced variants to assess their predictive ability of discriminating essential genes from non-essential genes. In addition, the classifiers were also evaluated using a novel blind testing strategy, called LOGO (Leave-One-Genome-Out) and LOTO (Leave-One-Taxon group-Out) tests on carefully constructed held-out datasets (both genome-wise (LOGO) and taxonomic group-wise (LOTO)) that were not used in training of the classifier models. Prediction performance metrics, accuracy, sensitivity, specificity, precision and area under the Receiver Operating Characteristics (AU-ROC) were assessed for DT, SVM and ANN derived models. Empirical results from 10 X 10-fold stratified cross validation, Leave-One-Genome-Out (LOGO) and Leave-One-Taxon group-Out (LOTO) blind testing experiments indicate SVM and ANN based models perform better than Decision Tree based models. On 10 X 10-fold cross validations, the SVM based models achieved an AU-ROC score of 0.80, while ANN and DT achieved 0.79 and 0.68 respectively. Both LOGO (genome-wise) and LOTO (taxonwise) blind tests revealed the generalization extent of these classifiers across different genomes and taxonomic orders.
This study empirically demonstrated the merits of applying machine learning methods to predict essential genes in microbial genomes by using only gene sequence and features derived from it. It also demonstrated that it is possible to predict essential genes based on features derived from gene sequence without using homology information. LOGO and LOTO Blind test results reveal that the trained classifiers do generalize across genomes and taxonomic boundaries and provide first critical estimate of predictive performance on microbial genomes. Overall, this study provides a systematic assessment of applying DT, ANN and SVM to this prediction problem.
An important potential application of this study will be to apply the resultant predictive model/approach and integrate it as a genome annotation pipeline method for comparative microbial genome and metagenome analysis resources such as the Integrated Microbial Genome Systems (IMG and IMG/M).
|
Page generated in 0.0821 seconds