Global ETD Search

1	Machine learning enabled bioinformatics tools for analysis of biologically diverse samples Lu, Yingzhou 25 August 2023 (has links) Advanced molecular profiling technologies, utilizing the entire human genome, have opened new avenues to study biological systems. In recent decades, the generation of vast volumes of multi-omics data, spanning a broad range of phenotypes. Development of advanced bioinformatics tools to identify informative biomarkers from these data becomes increasingly important. These tools are crucial to extract meaningful biomarkers from this data, especially for understanding the biological pathways responsible for disease development. The identification of signature genes and the analysis of differentially networked genes are two fundamental and critically important tasks. However, many current methodologies employ test statistics that don't align perfectly with the signature definition, potentially leading to the identification of imprecise signatures. It may be challenging because the test statistics employed by many prevailing methods fall short of fulfilling the exact definition of a marker genes, inherently leaving them susceptible to deriving inaccurate features. The problem is further compounded when attempting to identify marker genes across biologically diverse samples, especially when comparing more than two biological conditions. Additionally, traditional differential group analysis or co-expression analysis under singular conditions often falls short in certain scenarios. For instance, the subtle expression levels of transcription factors (TFs) make their detection daunting, despite their pivotal role in guiding gene expression. Pinpointing the intricate network landscape of complex ailments and isolating core genes for subsequent analysis are challenging tasks. Yet, these marker genes are instrumental in identifing potential pivotal pathways. Multi-omics data, with its inherent complexity and diversity, presents unique challenges that traditional methods might struggle to address effectively. Recognizing this, our team sought to introduce new and innovative techniques specifically designed to handle this intricate dataset. To overcome these challenges, it is vital to develop and adopt innovative methods tailored to handle the complexity and diversity inherent in multi-omics data. In response to these challenges, we have pioneered the Cosine-based One-sample Test (COT), a method meticulously crafted for the analysis of biologically diverse samples. Tailored to discern marker genes across a spectrum of subtypes using their expression profiles, COT employs a one-sample test framework. The test statistic within COT utilizes cosine similarity, comparing a molecule's expression profile across various subtypes with the precise mathematical representation of ideal marker genes. To ensure ease of application and accessibility, we've encapsulated the COT workflow within a Python package. To assess its effectiveness, we undertook an exhaustive evaluation, juxtaposing the marker genes detection capabilities of COT against its contemporaries. This evaluation employed realistic simulation data. Our findings indicated that COT was not only adept at handling gene expression data but was also proficient with proteomics data. This data, sourced from enriched tissue or cell subtype samples, further accentuated COT's superior performance. We demonstrated the heightened effectiveness of COT when applied to gene expression and proteomics data originating from distinct tissue or cell subtypes. This led to innovative findings and hypotheses in several biomedical case studies. Additionally, we have enhanced the Differential Dependency Network (DDN) framework to detect network rewiring between different conditions where significantly rewired network modes serve as informative biomarkers. Using cross-condition data and a block-wise Lasso network model, DDN detects significant network rewiring together with a subnetwork of hub molecular entities. In DDN 3.0, we took the imbalanced sample size into the consideration, integrated several acceleration strategies to enable it to handle large datasets, and enhanced the network presentation for more informative network displays including color-coded differential dependency network and gradient heatmap. We applied it to the simulated data and real data to detect critical changes in molecular network topology. The current tool stands as a valuable blueprint for the development and validation of mechanistic disease models. This foundation aids in offering a coherent interpretation of data, deepening our understanding of disease biology, and sparking new hypotheses ripe for subsequent validation and exploration. As we chart our future course, our vision is to expand the scope of tools like COT and DDN 3.0, explore the vast realm of multi-omics data, including those from longitudinal studies or clinical trials. We're looking at incorporating datasets from longitudinal studies and clinical trials – domains where data complexity scales to new heights. We believe that these tools can facilitate more nuanced and comprehensive understanding of disease development and progression. Furthermore, by integrating these methods with other advanced bioinformatics and machine learning tools, we aim to create a holistic pipeline that will allow for seamless extraction of significant biomarkers and actionable insights from multi-omics data. This is a promising step towards precision medicine, where individual genomic information can guide personalized treatment strategies. / Doctor of Philosophy / Recent advances in technology have allowed us to study human biology on a much larger scale than ever before. These technologies have produced a lot of data on many different types of traits. As a result, it's becoming increasingly important to develop tools that can sift through this data and find meaningful biomarkers – essentially, indicators that can help us understand what causes diseases. Two key parts of this process are identifying 'signature genes' and analyzing groups of genes that work together differently depending on the circumstances. But, current methods have their drawbacks – they don't always pick out the right genes and can struggle when comparing more than two groups at once. There are also other challenges when it comes to identifying groups of genes that express differently or work together under one set of conditions. For instance, some important genes – known as transcription factors (TFs) – control the activity of other genes. But because TFs are often expressed at low levels, they're hard to detect, even though they play a key role in controlling gene activity. And, it can be tough to identify 'hub' genes, which are central to gene networks and can help us understand the potential key pathways in diseases. To address these challenges, we introduced the Cosine based One-sample Test (COT), a novel approach to identify pivotal genes across diverse samples. COT gauges the alignment of a gene's expression profile with the quintessential marker genes' definition. Our evaluations underscore COT's robust performance, paving the way for deeper disease understanding. Further enhancing our toolkit, we've refined the Differential Dependency Network (DDN), a method to unravel the dynamic interplay of genes under diverse conditions. DDN 3.0 is a more robust iteration, adept at accommodating varied sample sizes, efficiently processing vast datasets, and offering richer visualizations of gene networks. Its prowess in pinpointing crucial alterations in gene networks is noteworthy. The Cosine based One-sample Test (COT) and the Differential Dependency Network (DDN) are revolutionary tools, poised to significantly elevate genomics research. COT, with its precision in gauging the alignment of a gene's expression pattern with predefined ideal gene markers, emerges as an invaluable asset in the hunt for marker genes. It acts as a fine-tuned sieve, meticulously screening vast datasets to unveil these crucial genetic signposts. On the other hand, DDN offers a comprehensive framework to decipher the intricate web of gene interactions under diverse conditions. It meticulously analyzes the interplay between genes, spotlighting potential 'hub' genes and highlighting shifts in their dynamic relationships. Together, COT and DDN not only pave the way for the identification of pivotal marker genes but also furnish a richer, more nuanced understanding of the genomic landscape. By leveraging these tools, researchers are empowered to unravel the intricate tapestry of genes, laying the foundation for groundbreaking discoveries in genomics. Looking to the future, we plan to apply COT and DDN 3.0 to more complex datasets. We believe these tools will give us a better understanding of how diseases develop and progress. By integrating these methods with other advanced tools, we're aiming to create a complete system for extracting important biomarkers and insights from this complex data. This is a big step towards precision medicine, where a person's unique genetic information could guide their treatment strategy. machine learning biomarkers pathway analysis differential network analysis multi- omics integration
2	AI for Omics and Imaging Models in Precision Medicine and Toxicology Bussola, Nicole 01 July 2022 (has links) This thesis develops an Artificial Intelligence (AI) approach intended for accurate patient stratification and precise diagnostics/prognostics in clinical and preclinical applications. The rapid advance in high throughput technologies and bioinformatics tools is still far from linking precisely the genome-phenotype interactions with the biological mechanisms that underlie pathophysiological conditions. In practice, the incomplete knowledge on individual heterogeneity in complex diseases keeps forcing clinicians to settle for surrogate endpoints and therapies based on a generic one-size-fits-all approach. The working hypothesis is that AI can add new tools to elaborate and integrate together in new features or structures the rich information now available from high-throughput omics and bioimaging data, and that such re- structured information can be applied through predictive models for the precision medicine paradigm, thus favoring the creation of safer tailored treatments for specific patient subgroups. The computational techniques in this thesis are based on the combination of dimensionality reduction methods with Deep Learning (DL) architectures to learn meaningful transformations between the input and the predictive endpoint space. The rationale is that such transformations can introduce intermediate spaces offering more succinct representations, where data from different sources are summarized. The research goal was attacked at increasing levels of complexity, starting from single input modalities (omics and bioimaging of different types and scales), to their multimodal integration. The approach also deals with the key challenges for machine learning (ML) on biomedical data, i.e. reproducibility, stability, and interpretability of the models. Along this path, the thesis contribution is thus the development of a set of specialized AI models and a core framework of three tools of general applicability: i. A Data Analysis Plan (DAP) for model selection and evaluation of classifiers on omics and imaging data to avoid selection bias. ii. The histolab Python package that standardizes the reproducible pre-processing of Whole Slide Images (WSIs), supported by automated testing and easily integrable in DL pipelines for Digital Pathology. iii. Unsupervised and dimensionality reduction techniques based on the UMAP and TDA frameworks for patient subtyping. The framework has been successfully applied on public as well as original data in precision oncology and predictive toxicology. In the clinical setting, this thesis has developed1: 1. (DAPPER) A deep learning framework for evaluation of predictive models in Digital Pathology that controls for selection bias through properly designed data partitioning schemes. 2. (RADLER) A unified deep learning framework that combines radiomics fea- tures and imaging on PET-CT images for prognostic biomarker development in head and neck squamous cell carcinoma. The mixed deep learning/radiomics approach is more accurate than using only one feature type. 3. An ML framework for automated quantification tumor infiltrating lymphocytes (TILs) in onco-immunology, validated on original pathology Neuroblastoma data of the Bambino Gesu’ Children’s Hospital, with high agreement with trained pathologists. The network-based INF pipeline, which applies machine learning models over the combination of multiple omics layers, also providing compact biomarker signatures. INF was validated on three TCGA oncogenomic datasets. In the preclinical setting the framework has been applied for: 1. Deep and machine learning algorithms to predict DILI status from gene expression (GE) data derived from cancer cell lines on the CMap Drug Safety dataset. 2. (ML4TOX) Deep Learning and Support Vector Machine models to predict potential endocrine disruption of environmental chemicals on the CERAPP dataset. 3. (PathologAI) A deep learning pipeline combining generative and convolutional models for preclinical digital pathology. Developed as an internal project within the FDA/NCTR AIRForce initiative and applied to predict necrosis on images from the TG-GATEs project, PathologAI aims to improve accuracy and reduce labor in the identification of lesions in predictive toxicology. Furthermore, GE microarray data were integrated with histology features in a unified multi-modal scheme combining imaging and omics data. The solutions were developed in collaboration with domain experts and considered promising for application. Settore BIO/13 - Biologia Applicata

Search results

Machine learning enabled bioinformatics tools for analysis of biologically diverse samples

AI for Omics and Imaging Models in Precision Medicine and Toxicology