This research investigates the application of machine learning techniques in computational genomics across two distinct domains: (1) the predicting the source of bacterial pathogen using whole genome sequencing data, and (2) the functional annotation of genes using single- cell RNA sequencing data. This work proposes the development of a bioinformatics pipeline tailored for identifying genomic variants, including gene presence/absence and single nu- cleotide polymorphism. This methodology is applied to specific strains such as Salmonella enterica serovar Typhimurium and the Ralstonia solanacearum species complex. Phylo- genetic analyses along with pan-genome and positive selection studiesshow that genomic variants and evolutionary patterns of S. Typhimurium vary across sources, which suggests that sources can be accurately attributed based on genomic variants empowered by machine learning. We benchmarked seven traditional machine learning algorithms, achieving a no- table accuracy of 94.6% in host prediction for S. Typhimurium using the Random Forest model, underscored by SHAP value analyses which elucidated key predictive features. Next, the focus is shifted to the prediction of Gene Ontology terms for Arabidopsis genes using single-cell RNA-seq data. This analysis offers a detailed comparison of gene expression in root versus shoot tissues, juxtaposed with insights from bulk RNA-seq data. The integration of regulatory network data from DAP-seq significantly enhances the prediction accuracy of gene functions. / Master of Science / This work applies machine learning techniques to two areas in computational biology: pre- dicting the hosts of bacterial pathogens based on their genome data, and predicting the func- tions of plant genes using single-cell gene expression data. The first part develops a method to analyze genome sequences from bacterial pathogens like Salmonella enterica serovar Ty- phimurium and the Ralstonia solanacearum species complex, identifying genomic variants, including gene presence/absence and single nucleotide polymorphism, which are variations in genetic code. By studying the evolutionary relationships and genetic diversity among dif- ferent strains, the motivation for using machine learning models to predict the sources (e.g., poultry, swine) of the pathogen genomes is established. Several machine learning models are then trained on these datasets, and the most important factors contributing to the predic- tions are identified. The second part focuses on predicting the functions of genes in the model plant species Arabidopsis thaliana using the gene expression data measured at the single-cell level to train machine learning models for identifying standardized gene function descrip- tions called Gene Ontology (GO) terms. By comparing results from single-cell and bulk tissue data, the study evaluates whether the higher resolution of single-cell data improves gene function prediction accuracy. Additionally, by incorporating information about gene regulation from a specialized experiment, the role of gene expression control in determining gene functions is explored.
Identifer | oai:union.ndltd.org:VTETD/oai:vtechworks.lib.vt.edu:10919/119357 |
Date | 07 June 2024 |
Creators | Chinnareddy, Sandeep |
Contributors | Computer Science and Applications, Li, Song, Liao, Jingqiu, Wang, Xuan, Zhang, Liqing |
Publisher | Virginia Tech |
Source Sets | Virginia Tech Theses and Dissertation |
Language | English |
Detected Language | English |
Type | Thesis |
Format | ETD, application/pdf |
Rights | In Copyright, http://rightsstatements.org/vocab/InC/1.0/ |
Page generated in 0.0022 seconds