Global ETD Search

Return to search

Machine learning enabled bioinformatics tools for analysis of biologically diverse samples

Advanced molecular profiling technologies, utilizing the entire human genome, have opened new avenues to study biological systems. In recent decades, the generation of vast volumes of multi-omics data, spanning a broad range of phenotypes. Development of advanced bioinformatics tools to identify informative biomarkers from these data becomes increasingly important. These tools are crucial to extract meaningful biomarkers from this data, especially for understanding the biological pathways responsible for disease development.

The identification of signature genes and the analysis of differentially networked genes are two fundamental and critically important tasks. However, many current methodologies employ test statistics that don't align perfectly with the signature definition, potentially leading to the identification of imprecise signatures. It may be challenging because the test statistics employed by many prevailing methods fall short of fulfilling the exact definition of a marker genes, inherently leaving them susceptible to deriving inaccurate features. The problem is further compounded when attempting to identify marker genes across biologically diverse samples, especially when comparing more than two biological conditions.

Additionally, traditional differential group analysis or co-expression analysis under singular conditions often falls short in certain scenarios. For instance, the subtle expression levels of transcription factors (TFs) make their detection daunting, despite their pivotal role in guiding gene expression. Pinpointing the intricate network landscape of complex ailments and isolating core genes for subsequent analysis are challenging tasks. Yet, these marker genes are instrumental in identifing potential pivotal pathways.

Multi-omics data, with its inherent complexity and diversity, presents unique challenges that traditional methods might struggle to address effectively. Recognizing this, our team sought to introduce new and innovative techniques specifically designed to handle this intricate dataset. To overcome these challenges, it is vital to develop and adopt innovative methods tailored to handle the complexity and diversity inherent in multi-omics data.

In response to these challenges, we have pioneered the Cosine-based One-sample Test (COT), a method meticulously crafted for the analysis of biologically diverse samples. Tailored to discern marker genes across a spectrum of subtypes using their expression profiles, COT employs a one-sample test framework. The test statistic within COT utilizes cosine similarity, comparing a molecule's expression profile across various subtypes with the precise mathematical representation of ideal marker genes.

To ensure ease of application and accessibility, we've encapsulated the COT workflow within a Python package. To assess its effectiveness, we undertook an exhaustive evaluation, juxtaposing the marker genes detection capabilities of COT against its contemporaries. This evaluation employed realistic simulation data. Our findings indicated that COT was not only adept at handling gene expression data but was also proficient with proteomics data. This data, sourced from enriched tissue or cell subtype samples, further accentuated COT's superior performance. We demonstrated the heightened effectiveness of COT when applied to gene expression and proteomics data originating from distinct tissue or cell subtypes. This led to innovative findings and hypotheses in several biomedical case studies.

Additionally, we have enhanced the Differential Dependency Network (DDN) framework to detect network rewiring between different conditions where significantly rewired network modes serve as informative biomarkers. Using cross-condition data and a block-wise Lasso network model, DDN detects significant network rewiring together with a subnetwork of hub molecular entities. In DDN 3.0, we took the imbalanced sample size into the consideration, integrated several acceleration strategies to enable it to handle large datasets, and enhanced the network presentation for more informative network displays including color-coded differential dependency network and gradient heatmap. We applied it to the simulated data and real data to detect critical changes in molecular network topology. The current tool stands as a valuable blueprint for the development and validation of mechanistic disease models. This foundation aids in offering a coherent interpretation of data, deepening our understanding of disease biology, and sparking new hypotheses ripe for subsequent validation and exploration.

As we chart our future course, our vision is to expand the scope of tools like COT and DDN 3.0, explore the vast realm of multi-omics data, including those from longitudinal studies or clinical trials. We're looking at incorporating datasets from longitudinal studies and clinical trials – domains where data complexity scales to new heights. We believe that these tools can facilitate more nuanced and comprehensive understanding of disease development and progression. Furthermore, by integrating these methods with other advanced bioinformatics and machine learning tools, we aim to create a holistic pipeline that will allow for seamless extraction of significant biomarkers and actionable insights from multi-omics data. This is a promising step towards precision medicine, where individual genomic information can guide personalized treatment strategies. / Doctor of Philosophy / Recent advances in technology have allowed us to study human biology on a much larger scale than ever before. These technologies have produced a lot of data on many different types of traits. As a result, it's becoming increasingly important to develop tools that can sift through this data and find meaningful biomarkers – essentially, indicators that can help us understand what causes diseases.

Two key parts of this process are identifying 'signature genes' and analyzing groups of genes that work together differently depending on the circumstances. But, current methods have their drawbacks – they don't always pick out the right genes and can struggle when comparing more than two groups at once.

There are also other challenges when it comes to identifying groups of genes that express differently or work together under one set of conditions. For instance, some important genes – known as transcription factors (TFs) – control the activity of other genes. But because TFs are often expressed at low levels, they're hard to detect, even though they play a key role in controlling gene activity. And, it can be tough to identify 'hub' genes, which are central to gene networks and can help us understand the potential key pathways in diseases.

To address these challenges, we introduced the Cosine based One-sample Test (COT), a novel approach to identify pivotal genes across diverse samples. COT gauges the alignment of a gene's expression profile with the quintessential marker genes' definition. Our evaluations underscore COT's robust performance, paving the way for deeper disease understanding.

Further enhancing our toolkit, we've refined the Differential Dependency Network (DDN), a method to unravel the dynamic interplay of genes under diverse conditions. DDN 3.0 is a more robust iteration, adept at accommodating varied sample sizes, efficiently processing vast datasets, and offering richer visualizations of gene networks. Its prowess in pinpointing crucial alterations in gene networks is noteworthy.

The Cosine based One-sample Test (COT) and the Differential Dependency Network (DDN) are revolutionary tools, poised to significantly elevate genomics research. COT, with its precision in gauging the alignment of a gene's expression pattern with predefined ideal gene markers, emerges as an invaluable asset in the hunt for marker genes. It acts as a fine-tuned sieve, meticulously screening vast datasets to unveil these crucial genetic signposts. On the other hand, DDN offers a comprehensive framework to decipher the intricate web of gene interactions under diverse conditions. It meticulously analyzes the interplay between genes, spotlighting potential 'hub' genes and highlighting shifts in their dynamic relationships.

Together, COT and DDN not only pave the way for the identification of pivotal marker genes but also furnish a richer, more nuanced understanding of the genomic landscape. By leveraging these tools, researchers are empowered to unravel the intricate tapestry of genes, laying the foundation for groundbreaking discoveries in genomics.

Looking to the future, we plan to apply COT and DDN 3.0 to more complex datasets. We believe these tools will give us a better understanding of how diseases develop and progress. By integrating these methods with other advanced tools, we're aiming to create a complete system for extracting important biomarkers and insights from this complex data. This is a big step towards precision medicine, where a person's unique genetic information could guide their treatment strategy.

machine learning

biomarkers

pathway analysis

differential network analysis

multi- omics integration

Identifer	oai:union.ndltd.org:VTETD/oai:vtechworks.lib.vt.edu:10919/116137
Date	25 August 2023
Creators	Lu, Yingzhou
Contributors	Electrical and Computer Engineering, Wang, Yue J., Lu, Chang Tien, Yu, Guoqiang, Chantem, Thidapat, Abbott, Amos L.
Publisher	Virginia Tech
Source Sets	Virginia Tech Theses and Dissertation
Language	English
Detected Language	English
Type	Dissertation
Format	ETD, application/pdf
Rights	Creative Commons Attribution-NonCommercial 4.0 International, http://creativecommons.org/licenses/by-nc/4.0/

Page generated in 0.0033 seconds

Machine learning enabled bioinformatics tools for analysis of biologically diverse samples

Description

Links & Downloads

Tags

Additional Fields