Global ETD Search

Return to search

Statistical and computational methods for addressing heterogeneity in genomic data

Heterogeneity describes any variability across different datasets. In genomic studies which profile gene expression levels, the presence of heterogeneity is ubiquitous, and may bring challenges to the integrative analysis of multiple datasets. Thus, many efforts are needed to understand and address the impact of heterogeneity. In this dissertation, I have developed novel statistical models and computational software for this purpose. I derived reference-batch ComBat and ComBat-Seq, two improved models based on the state-of-the-art method, ComBat, for addressing one particular type of heterogeneity known as the “batch effects”. I showed their benefits compared to the existing methods in several data types and situations, and implemented these models in publicly available software. Then, I created systematic simulations to explore the impact of common study heterogeneity on the independent validation of genomic prediction models, showing that the most identifiable sources of heterogeneity are not the primary ones affecting the validation of genomic predictors. Finally, I adapted a solution using cross-study ensemble learning to train predictors with generalizable independent performance, to address the unwanted impact of batch effects on prediction. I compared this new framework with the traditional approach for batch correction, showing that cross-study learning may provide a more robust-performing model in independent validation. Results in this dissertation provide insights and guidelines for working with heterogeneous gene expression profiling datasets in practice, and encourage further investigation on understanding and addressing heterogeneity in genomic studies

https://hdl.handle.net/2144/41301

Bioinformatics

Identifer	oai:union.ndltd.org:bu.edu/oai:open.bu.edu:2144/41301
Date	16 July 2020
Creators	Zhang, Yuqing
Contributors	Johnson, W. Evan
Source Sets	Boston University
Language	en_US
Detected Language	English
Type	Thesis/Dissertation
Rights	Attribution 4.0 International, http://creativecommons.org/licenses/by/4.0/

Page generated in 0.0026 seconds

Statistical and computational methods for addressing heterogeneity in genomic data

Description

Links & Downloads

Tags

Additional Fields