Return to search

Multimodal, longitudinal, and mega-analysis of biomedical data

Biomedical data science is a multi-disciplinary field concerned with the collection, storage, and interpretation of biomedical data that uses annotation, algorithms, and analysis to extract knowledge and insights from structured and unstructured data to be used in the development and evaluation of diagnostic tests, prognostic predictions, and therapeutic interventions. Biomedical data scientists perform this work using biomedical data that arises when samples are subjected to biochemical assays to quantitively or qualitatively investigate their pathophysiological characteristics. Increasingly, biomedical data are generated at single-cell resolution and have consequently become far more hierarchical and multimodal in nature – that is, levels of organization encapsulate one another (e.g., samples belonging to subjects are made up of cells) and multiple biological modalities are profiled simultaneously. The paradigm shift adds significant complexity to the collection, storage, management, and analysis of biomedical data, but brings with it the promise of unprecedented insights to be gained from integrative analyses. These analyses are the focus of this dissertation, where the challenges of integrating biomedical data across multiple modalities, timepoints, and studies are examined through three research projects.

Challenges related to multimodal analysis of biomedical data will be explored through the development of MultimodalExperiment, a data structure that appropriately and efficiently represents multiomics data that is hierarchical, multimodal, and/or longitudinal in nature. A schematic of and methods for the data structure will be presented along with example usage to demonstrate how current challenges of alternative data structures are overcome, ease of data management is improved, and computational/storage efficiency is optimized.

Challenges related to longitudinal analysis of biomedical data will be explored in the context of a cohort study of cancer patients being treated with anti-programmed cell death protein 1/programmed cell death ligand 1 immunotherapies at Boston Medical Center. The progression-free survival status of study participants will be analyzed using linear mixed effects models which incorporate longitudinal high-dimensional metabolomics data. Maps of metabolic pathways and a hypothesis will be presented to explain serum metabolites that are associated with progress-free survival status and possibly therapeutic efficacy.

Challenges related to mega-analysis of biomedical data will be explored through the creation of a pipeline to preprocess transcriptomics data from human host infected with tuberculosis to support machine learning and other tasks. The details of original software developed to provide more than 10,000 samples of clean high-quality machine learning ready data from all related and eligible studies in the Gene Expression Omnibus repository will be illustrated. The importance improving diagnostic testing and therapeutic interventions for tuberculosis disease will be highlighted in the context of these data, and the specifics of why they represent a key ingredient for machine learning that helps overcome current challenges in the field will be explained.

Identiferoai:union.ndltd.org:bu.edu/oai:open.bu.edu:2144/47473
Date07 November 2023
CreatorsSchiffer, Lucas
ContributorsJohnson, W. Evan
Source SetsBoston University
Languageen_US
Detected LanguageEnglish
TypeThesis/Dissertation

Page generated in 0.0408 seconds