Return to search

A Study of Machine Learning Approaches for Biomedical Signal Processing

The introduction of high-throughput molecular profiling technologies provides the capability of studying diverse biological systems at molecular level. However, due to various limitations of measurement instruments, data preprocessing is often required in biomedical research. Improper preprocessing will have negative impact on the downstream analytics tasks. This thesis studies two important preprocessing topics: missing value imputation and between-sample normalization.
Missing data is a major issue in quantitative proteomics data analysis. While many methods have been developed for imputing missing values in high-throughput proteomics data, comparative assessment on the accuracy of existing methods remains inconclusive, mainly because the true missing mechanisms are complex and the existing evaluation methodologies are imperfect. Moreover, few studies have provided an outlook of current and future development.
We first report an assessment of eight representative methods collectively targeting three typical missing mechanisms. The selected methods are compared on both realistic simulation and real proteomics datasets, and the performance is evaluated using three quantitative measures. We then discuss fused regularization matrix factorization, a popular low-rank matrix factorization framework with similarity and/or biological regularization, which is extendable to integrating multi-omics data such as gene expressions or clinical variables. We further explore the potential application of convex analysis of mixtures, a biologically inspired latent variable modeling strategy, to missing value imputation. The preliminary results on proteomics data are provided together with an outlook into future development directions.
While a few winners emerged from our comparative assessment, data-driven evaluation of imputation methods is imperfect because performance is evaluated indirectly on artificial missing or masked values not authentic missing values. Imputation accuracy may vary with signal intensity. Fused regularization matrix factorization provides a possibility of incorporating external information. Convex analysis of mixtures presents a biologically plausible new approach.
Data normalization is essential to ensure accurate inference and comparability of gene expressions across samples or conditions. Ideally, gene expressions should be rescaled based on consistently expressed reference genes. However, for normalizing biologically diverse samples, the most commonly used reference genes have exhibited striking expression variability, and distribution-based approaches can be problematic when differentially expressed genes are significantly asymmetric.
We introduce a Cosine score based iterative normalization (Cosbin) strategy to normalize biologically diverse samples. The between-sample normalization is based on iteratively identified consistently expressed genes, where differentially expressed genes are sequentially eliminated according to scale-invariant Cosine scores.
We evaluate the performance of Cosbin and four other representative normalization methods (Total count, TMM/edgeR, DESeq2, DEGES/TCC) on both idealistic and realistic simulation data sets. Cosbin consistently outperforms the other methods across various performance criteria. Implemented in open-source R scripts and applicable to grouped or individual samples, the Cosbin tool will allow biologists to detect subtle yet important molecular signals across known or novel phenotypic groups. / Master of Science / Data preprocessing is often required due to various limitations of measurement instruments in biomedical research. This thesis studies two important preprocessing topics: missing value imputation and between-sample normalization.
Missing data is a major issue in quantitative proteomics data analysis. Imputation is the process of substituting for missing values. We propose a more realistic assessment workflow which can preserve the original data distribution, and then assess eight representative general-purpose imputation strategies. We explore two biologically inspired imputation approaches: fused regularization matrix factorization (FRMF) and convex analysis of mixtures (CAM) imputation. FRMF integrates external information such as clinical variables and multi-omics data into imputation, while CAM imputation incorporates biological assumptions. We show that the integration of biological information improves the imputation performance.
Data normalization is required to ensure correct comparison. For gene expression data, between sample normalization is needed. We propose a Cosine score based iterative normalization (Cosbin) strategy to normalize biologically diverse samples. We show that Cosbin significantly outperform other methods in both ideal simulation and realistic simulation. Implemented in open-source R scripts and applicable to grouped or individual samples, the Cosbin tool will allow biologists to detect subtle yet important molecular signals across known or novel cell types.

Identiferoai:union.ndltd.org:VTETD/oai:vtechworks.lib.vt.edu:10919/112783
Date10 June 2021
CreatorsShen, Minjie
ContributorsElectrical and Computer Engineering, Wang, Yue J., Yu, Guoqiang, Chantem, Thidapat
PublisherVirginia Tech
Source SetsVirginia Tech Theses and Dissertation
Detected LanguageEnglish
TypeThesis
FormatETD, application/pdf
RightsIn Copyright, http://rightsstatements.org/vocab/InC/1.0/

Page generated in 0.0021 seconds