Global ETD Search

11	Predicting "Essential" Genes in Microbial Genomes: A Machine Learning Approach to Knowledge Discovery in Microbial Genomic Data Palaniappan, Krishnaveni 01 January 2010 (has links) Essential genes constitute the minimal gene set of an organism that is indispensable for its survival under most favorable conditions. The problem of accurately identifying and predicting genes essential for survival of an organism has both theoretical and practical relevance in genome biology and medicine. From a theoretical perspective it provides insights in the understanding of the minimal requirements for cellular life and plays a key role in the emerging field of synthetic biology; from a practical perspective, it facilitates efficient identification of potential drug targets (e.g., antibiotics) in novel pathogens. However, characterizing essential genes of an organism requires sophisticated experimental studies that are expensive and time consuming. The goal of this research study was to investigate machine learning methods to accurately classify/predict "essential genes" in newly sequenced microbial genomes based solely on their genomic sequence data. This study formulates the predication of essential genes problem as a binary classification problem and systematically investigates applicability of three different supervised classification methods for this task. In particular, Decision Tree (DT), Support Vector Machine (SVM), and Artificial Neural Network (ANN) based classifier models were constructed and trained on genomic features derived solely from gene sequence data of 14 experimentally validated microbial genomes whose essential genes are known. A set of 52 relevant genomic sequence derived features (including gene and protein sequence features, protein physio-chemical features and protein sub-cellular features) was used as input for the learners to learn the classifier models. The training and test datasets used in this study reflected between-class imbalance (i.e. skewed majority class vs. minority class) that is intrinsic to this data domain and essential genes prediction problem. Two imbalance reduction techniques (homology reduction and random under sampling of 50% of the majority class) were devised without artificially balancing the datasets and compromising classifier generalizability. The classifier models were trained and evaluated using 10-fold stratified cross validation strategy on both the full multi-genome datasets and its class imbalance reduced variants to assess their predictive ability of discriminating essential genes from non-essential genes. In addition, the classifiers were also evaluated using a novel blind testing strategy, called LOGO (Leave-One-Genome-Out) and LOTO (Leave-One-Taxon group-Out) tests on carefully constructed held-out datasets (both genome-wise (LOGO) and taxonomic group-wise (LOTO)) that were not used in training of the classifier models. Prediction performance metrics, accuracy, sensitivity, specificity, precision and area under the Receiver Operating Characteristics (AU-ROC) were assessed for DT, SVM and ANN derived models. Empirical results from 10 X 10-fold stratified cross validation, Leave-One-Genome-Out (LOGO) and Leave-One-Taxon group-Out (LOTO) blind testing experiments indicate SVM and ANN based models perform better than Decision Tree based models. On 10 X 10-fold cross validations, the SVM based models achieved an AU-ROC score of 0.80, while ANN and DT achieved 0.79 and 0.68 respectively. Both LOGO (genome-wise) and LOTO (taxonwise) blind tests revealed the generalization extent of these classifiers across different genomes and taxonomic orders. This study empirically demonstrated the merits of applying machine learning methods to predict essential genes in microbial genomes by using only gene sequence and features derived from it. It also demonstrated that it is possible to predict essential genes based on features derived from gene sequence without using homology information. LOGO and LOTO Blind test results reveal that the trained classifiers do generalize across genomes and taxonomic boundaries and provide first critical estimate of predictive performance on microbial genomes. Overall, this study provides a systematic assessment of applying DT, ANN and SVM to this prediction problem. An important potential application of this study will be to apply the resultant predictive model/approach and integrate it as a genome annotation pipeline method for comparative microbial genome and metagenome analysis resources such as the Integrated Microbial Genome Systems (IMG and IMG/M). Computational biology Essential Genes Genomic features Machine learning Microbial genomes Supervised learning Computer Sciences
12	Merging metagenomic and microarray technologies to explore bacterial catabolic potential of Arctic soils Whissell, Gavin. January 2006 (has links) No description available. DNA microarrays. Gene libraries. Microbial genomes.
13	Exploring the fusion of metagenomic library and DNA microarray technologies Spiegelman, Dan. January 2006 (has links) We explored the combination of metagenomic library and DNA microarray technologies into a single platform as a novel way to rapidly screen metagenomic libraries for genetic targets. In the "metagenomic microarray" system, metagenomic library clone DNA is printed on a microarray surface, and clones of interest are detected by hybridization to single-gene probes. This study represents the initial steps in the development of this technology. We constructed two 5,000-clone large-insert metagenomic libraries from two diesel-contaminated Arctic soil samples. We developed and optimized an automated fosmid purification protocol to rapidly-extract clone DNA in a high-throughput 96-well format. We then created a series of small prototype arrays to optimize various parameters of microarray printing and hybridization, to identify and resolve technical challenges, and to provide proof-of-principle of this novel application. Our results suggest that this method shows promise, but more experimentation must be done to establish the feasibility of this approach. Microbial genomes. Gene libraries. DNA microarrays.
14	Exploring the fusion of metagenomic library and DNA microarray technologies Spiegelman, Dan. January 2006 (has links) No description available. DNA microarrays. Gene libraries. Microbial genomes.
15	Organisation et expression des gènes de résistance aux métaux lourds chez Cupriavidus metallidurans CH34 Monchy, Sébastien 04 June 2007 (has links) Cupriavidus metallidurans CH34 est une béta-protéobactérie, résistante aux métaux lourds, isolée des sédiments d'une usine de métallurgie non-ferreuse en Belgique. <p>Le génome de cette bactérie contient un chromosome (3.6 Mb), un mégaplasmide (2.6 Mb) et deux plasmides pMOL28 (171 kb) et pMOL30 (234 kb) déjà connus pour porter des gènes de résistance aux métaux lourds. <p>Nous avons d'abord fait le catalogue des gènes impliqués dans la résistance aux métaux lourds et, ensuite, cherché à mesurer leur expression par deux approches transcriptomiques :RT-PCR et puces à ADN.<p> L'analyse du génome montre au moins 170 gènes relatifs à la résistance aux ions métalliques localisés sur les 4 réplicons, principalement sur les deux plasmides. Ces gènes codent essentiellement pour des systèmes d'efflux tel que les HME-RND (transport chimioosmotique avec flux de protons à contresens), les ATPases de type P ou encore pour le système de résistance aux ions Cu(II). Dans le génome de C. metallidurans, nous avons identifié 13 opérons qui codent pour des systèmes HME-RND, seuls trois, localisés sur les plasmides, sont surexprimés en présence de métaux lourds. Huit gènes codent pour des ATPases de type P, dont deux appartiennent à une classe dont les substrats ne sont pas métalliques. Deux ATPases appartiennent à une famille spécialisée pour l'efflux du Cu(II) et les quatre autres à une autre grande famille impliquée dans l'efflux des ions Cd(II), Pb(II) et Zn(II). Les analyses transcriptomiques montrent la surexpression des deux premières classes d'ATPases P en présence des métaux lourds. La mutagenèse du gène zntA (mégaplasmide), codant pour l'une des ATPases, provoque une diminution de la viabilité en présence de Zn(II), Cd(II) et dans une moindre mesure de Pb(II), Tl(I) et Bi(III). <p>Sur pMOL30, la résistance au cuivre implique un groupe de 19 gènes cop codant pour la résistance au cuivre au niveau du périplasme et du cytoplasme, et vraisemblablement pour une forme de stockage du cuivre essentiel. Ces 19 gènes sont surexprimés en présence de cuivre, mais une quinzaine de gènes proches semblent aussi requis pour une expression optimale de la résistance au cuivre. <p>L'annotation des plasmides a mis en évidence la parenté du plasmide pMOL28 avec le plasmide pHG1 (hydrogénotrophie, fixation du CO2) de C. eutrophus H16 et le plasmide pSym (fixation de l'azote) de C. taiwanensis, et chez pMOL30, la présence de deux îlots génomiques concentrant la plupart des résistances aux métaux lourds. Les puces montrent la surexpression de 83 sur 164 gènes dans pMOL28, et de 143 sur 250 gènes dans pMOL30. Elles montrent aussi que les gènes présents sur les deux plasmides sont davantage surexprimés que ceux localisés sur les deux mégaréplicons. Parmi les gènes surexprimés les plus intéressants du plasmide pMOL30, il faut mentionner des transposases tronquées et des gènes impliqués dans la synthèse des membranes (glycosyltransférases). L'analyse de l'expression des gènes plasmidiens de résistance aux métaux lourds montre la surexpression en présence de plusieurs ions métalliques ajoutés indépendamment et pas seulement par les substrats métalliques de ces opérons, ce qui suggère l'intervention de deux types de régulation dont les gènes correspondants sont aussi localisés sur le chromosome et le mégaplasmide.<p>Ce travail met en évidence la spécialisation de la bactérie dans la réponse à un grand spectre de concentrations de métaux lourds, jusqu'à la limite majeure de la toxicité observée pour les bactéries mésophiles hétérotrophes. Cette spécialisation correspond bien aux biotopes industriels de divers continents dans lesquels on l'a trouvée. <p> / Doctorat en sciences, Spécialisation biologie moléculaire / info:eu-repo/semantics/nonPublished Sciences exactes et naturelles Biologie Heavy metals Ralstonia Microbial genomics Microbial genomes Genetic transcription Métaux lourds Ralstonia Génomique microbienne Génomes microbiens Transcription génétique ralstonia pMOL30 pMOL28 HME-RND metaux lourds ATPase cop RND CH34 metallidurans Cupriavidus
16	Promoter Prediction In Microbial Genomes Based On DNA Structural Features Rangannan, Vetriselvi 04 1900 (has links) (PDF) Promoter region is the key regulatory region, which enables the gene to be transcribed or repressed by anchoring RNA polymerase and other transcription factors, but it is difficult to determine experimentally. Hence an in silico identification of promoters is crucial in order to guide experimental work and to pin point the key region that controls the transcription initiation of a gene. Analysis of various genome sequences in the vicinity of experimentally identified transcription start sites (TSSs) in prokaryotic as well as eukaryotic genomes had earlier indicated that they have several structural features in common, such as lower stability, higher curvature and less bendability, when compared with their neighboring regions. In this thesis work, the variation observed for these DNA sequence dependent structural properties have been used to identify and delineate promoter regions from other genomic regions. Since the number of bacterial genomes being sequenced is increasing very rapidly, it is crucial to have procedures for rapid and reliable annotation of their functional elements such as promoter regions, which control the expression of each gene or each transcription unit of the genome. The thesis work addresses this requirement and presents step by step protocols followed to get a generic method for promoter prediction that can be applicable across organisms. The each paragraph below gives an overall idea about the thesis organization into chapters. An overview of prokaryotic transcriptional regulation, structural polymorphism adapted by DNA molecule and its impact on transcriptional regulation has been discussed in introduction chapter of this thesis (chapter 1). Standardization of promoter prediction methodology - Part I Based on the difference in stability between neighboring upstream and downstream regions in the vicinity of experimentally determined transcription start sites, a promoter prediction algorithm has been developed to identify prokaryotic promoter sequences in whole genomes. The average free energy (E) over known promoter sequences and the difference (D) between E and the average free energy over the random sequence generated using the downstream region of known TSS (REav) are used to search for promoters in the genomic sequences. Using these cutoff values to predict promoter regions across entire E. coli genome, a reliability of 70% has been achieved, when the predicted promoters were cross verified against the 960 transcription start sites (TSSs) listed in the Ecocyc database. Reliable promoter prediction is obtained when these genome specific threshold values were used to search for promoters in the whole E. coli genome sequence. Annotation of the whole E. coli genome for promoter region has been carried out with 49% accuracy. Reference Rangannan, V. and Bansal, M. (2007) Identification and annotation of promoter regions inmicrobial genome sequences on the basis of DNA stability. J Biosci, 32, 851-862. Standardization of promoter prediction methodology - Part II In this chapter, it has been demonstrated that while the promoter regions are in general less stable than the flanking regions, their average free energy varies depending on the GC composition of the flanking genomic sequence. Therefore, a set of free energy threshold values (TSS based threshold values), from the genomic DNA with varying GC content in the vicinity of experimentally identified TSSs have been obtained. These threshold values have been used as generic criteria for predicting promoter regions in E. coli and B. subtilis and M. tuberculosis genomes, using an in-house developed tool ‘PromPredict’. On applying it to predict promoter regions corresponding to the 1144 and 612 experimentally validated TSSs in E. coli (genome %GC : 50.8) and B. subtilis (genome %GC : 43.5) sensitivity of 99% and 95% and precision values of 58% and 60%, respectively, were achieved. For the limited data set of 81 TSSs available for M. tuberculosis (65.6% GC) a sensitivity of 100% and precision of 49% was obtained. Reference Rangannan, V. and Bansal, M. (2009) Relative stability of DNA as a generic criterion for promoter prediction: whole genome annotation of microbial genomes with varying nucleotide base composition. Mol Biosyst, 5, 1758 - 1769. Standardization of promoter prediction methodology - Part III In this chapter, the promoter prediction algorithm and the threshold values have been improved to predict promoter regions on a large scale over 913 microbial genome sequences. The average free energy (AFE) values for the promoter regions as well as their downstream regions are found to differ, depending on their GC content even with respect to translation start sites (TLSs) from 913 microbial genomes. The TSS based cut-off values derived in chapter 3 do not have cut-off values for both extremes of GC-bins at 5% interval. Hence, threshold values have been derived from a subset of translation start sites (TLSs) from all microbial genomes which were categorized based on their GC-content. Interestingly the cut-off values derived with respect to TSS data set (chapter 3) and TLS data set are very similar for the in-between GC-bins. Therefore, TSS based cut-off values derived in chapter 2 with the TLS based cut-off values have been combined (denoted as TSS-TLS based cutoff values) to predict promoters over the complete genome sequences. An average recall value of 72% (which indicates the percentage of protein and RNA coding genes with predicted promoter regions assigned to them) and precision of 56% is achieved over the 913 microbial genome dataset. These predicted promoter regions have been given a reliability level (low, medium, high, very high and highest) based on the difference in its relative average free energy, which can help the users design their experiments with more confidence by using the predictions with higher reliability levels. Reference Rangannan, V. and Bansal, M. (2010) High Quality Annotation of Promoter Regions for 913 Bacterial Genomes. Bioinformatics, 26, 3043-3050. Web applications PromBase : The predicted promoter regions for 913 microbial genomes were deposited into a public domain database called, PromBase which can serve as a valuable resource for comparative genomics study for their general genomic features and also help the experimentalist to rapidly access the annotation of the promoter regions in any given genome. This database is freely accessible for the users via the World Wide Web http://nucleix.mbu.iisc.ernet.in/prombase/. EcoProm : EcoProm is a database that can identify and display the potential promoter regions corresponding to EcoCyc annotated TSS and genes. Also displays predictions for whole genomic sequence of E. coli and EcoProm is available at http://nucleix.mbu.iisc.ernet.in/ecoprom/index.htm. PromPredict : The generic promoter prediction methodology described in previous chapters has been implemented in to an algorithm ‘PromPredict’ and available at http://nucleix.mbu.iisc.ernet.in/prompredict/prompredict.html. Analysing the DNA structural characteristic of prokaryotic promoter sequences for their predominance Sequence dependent structural properties and their variation in genomic DNA are important in controlling several crucial processes such as transcription, replication, recombination and chromatin compaction. In this chapter 6, quantitative analysis of sequences motifs as well as sequence dependent structural properties, such as curvature, bendability and stability in the upstream region of TSS and TLS from E. coli, B. subtilis and M. tuberculosis has been carried out in order to assess their predictive power for promoter regions. Also the correlation between these structural properties and GC-content has been investigated. Our results have shown that AFE values (stability) gives finer discrimination rather than %GC in identifying promoter regions and stability have shown to be the better structural property in delineating promoter regions from non-promoter regions. Analysis of these DNA structural properties has been carried out in human promoter sequences and observed to be correlating with the inactivation status of the X-linked genes in human genome. Since, it is deviating from the theme of main thesis; this chapter has been included as appendix A to the main thesis. General conclusion Stability is the ubiquitous DNA structural property seen in promoter regions. Stability shows finer discrimination for promoter prediction rather than directly using %GC-content. Based on relative stability of DNA, a generic promoter prediction algorithm has been developed and implemented to predict promoter regions on a large scale over 913 microbial genome sequences. The analysis of the predicted regions across organisms showed highly reliable predictive performance of the algorithm. Gene Mapping Gene - Promoter Region Microbes Genes DNA Structure Microbial Genomes Promoter Prediction Methodology Microbial Genome Sequences DNA Stability 913 Bacterial Genomes Prokaryotic Genomes Microbial Promoter Sequences Promoter Prediction Algorithm Biochemical Genetics

Page generated in 0.0561 seconds