Topic Model-based Mass Spectrometric Data Analysis in Cancer Biomarker Discovery Studies

Identification of disease-related alterations in molecular and cellular mechanisms may reveal useful biomarkers for human diseases including cancers. High-throughput omic technologies for identifying and quantifying multi-level biological molecules (e.g., proteins, glycans, and metabolites) have facilitated the advances in biological research in recent years. Liquid (or gas) chromatography coupled with mass spectrometry (LC/GC-MS) has become an essential tool in such large-scale omic studies. Appropriate LC/GC-MS data preprocessing pipelines are needed to detect true differences between biological groups. Challenges exist in several aspects of MS data analysis. Specifically for biomarker discovery, one fundamental challenge in quantitation of biomolecules is owing to the heterogeneous nature of human biospecimens. Although this issue has been a subject of discussion in cancer genomic studies, it has not yet been rigorously investigated in mass spectrometry based omic studies. Purification of mass spectometric data is highly desired prior to subsequent differential analysis.

In this research dissertation, we majorly target at addressing the purification problem through probabilistic modeling. We propose an intensity-level purification model (IPM) to computationally purify LC/GC-MS based cancerous data in biomarker discovery studies. We further extend IPM to scan-level purification model (SPM) by considering information from extracted ion chromatogram (EIC, scan-level feature). Both IPM and SPM belong to the category of topic modeling approach, which aims to identify the underlying "topics" (sources) and their mixture proportions in composing the heterogeneous data. Additionally, denoise deconvolution model (DMM) is proposed to capture the noise signals in samples based on purified profiles. Variational expectation-maximization (VEM) and Markov chain Monte Carlo (MCMC) methods are used to draw inference on the latent variables and estimate the model parameters. Before we come to purification, other research topics in related to mass spectrometric data analysis for cancer biomarker discovery are also investigated in this dissertation.

Chapter 3 discusses the developed methods in the differential analysis of LC/GC-MS based omic data, specifically for the preprocessing in data of LC-MS profiled glycans. Chapter 4 presents the assumptions and inference details of IPM, SPM, and DDM. A latent Dirichlet allocation (LDA) core is used to model the heterogeneous cancerous data as mixtures of topics consisting of sample-specific pure cancerous source and non-cancerous contaminants. We evaluated the capability of the proposed models in capturing mixture proportions of contaminants and cancer profiles on LC-MS based serum and tissue proteomic and GC-MS based tissue metabolomic datasets acquired from patients with hepatocellular carcinoma (HCC) and liver cirrhosis. Chapter 5 elaborates these applications in cancer biomarker discovery, where typical single omic and integrative analysis of multi-omic studies are included. / Ph. D. / This dissertation documents the methodology and outputs for computational deconvolution of heterogeneous omics data generated from biospecimens of interest. These omics data convey qualitative and quantitative information of biomolecules (e.g., glycans, proteins, metabolites, etc.) which are profiled by instruments named liquid (or gas) chromatography and mass spectrometer (LC/GC-MS). In the scenarios of biomarker discovery, we aim to find out the significant difference on intensities of biomolecules with respect to two specific phenotype groups so that the biomarkers can be used as clinical indicators for early stage diagnose. However, the purity of collected samples constitutes the fundamental challenge to the process of differential analysis. Instead of experimental methods that are costly and time-consuming, we treat the purification task as one of the topic modeling procedures, where we assume each observed biomolecular profile is a mixture of hidden pure source together with unwanted contaminants.

The developed models output the estimated mixture proportion as well as the underlying “topics”. With different level’s purification applied, improved discrimination power of candidate biomarkers and more biologically meaningful pathways were discovered in LC/GC-MS based multi-omic studies for liver cancer. This research work originates from a broader scope of probabilistic generative modeling, where rational assumptions are made to characterize the generation process of the observations. Therefore, the developed models in this dissertation have great potential in applications other than heterogeneous data purification discussed in this dissertation. A good example is to uncover the relationship of human gut microbiome with the host’s phenotypes of interest (e.g., disease like type-II diabetes). Similar challenges exist in how to infer the underlying intestinal flora distribution and estimate their mixture proportions.

This dissertation also covers topics of related data preprocessing and integration, but with a consistent goal in improving the performance of biomarker discovery. In summary, the research help address sample heterogeneity issue observed in LC/GC-MS based cancer biomarker discovery studies and shed light on computational deconvolution of the mixtures, which can be generalized to other domains of interest.

Identiferoai:union.ndltd.org:VTETD/oai:vtechworks.lib.vt.edu:10919/78201
Date14 June 2017
CreatorsWang, Minkun
ContributorsElectrical and Computer Engineering, Yu, Guoqiang, Wong, Kenneth H., Mili, Lamine M., Ressom, Habtom W., Wang, Yue J.
PublisherVirginia Tech
Source SetsVirginia Tech Theses and Dissertation
Detected LanguageEnglish
TypeDissertation
FormatETD, application/pdf
RightsIn Copyright, http://rightsstatements.org/vocab/InC/1.0/

Page generated in 0.0024 seconds