Return to search

Contributions to Sparse Statistical Methods for Data Integration

Background: Scientists are measuring multiple sources of massive, complex, and diverse data in hopes to better understand the principles underpinning complex phenomena. Sophisticated statistical and computational methods that reduce data complexity, harness variability, and integrate multiple sources of information are required. The ‘sparse’ class of multivariate statistical methods is becoming a promising solution to these data-driven challenges, but lacks application, testing, and development.
Methods: In this thesis, efforts are three-fold. Sparse principal component analysis (sparse PCA) and sparse canonical correlation analysis (sparse CCA) are applied to a large toxicogenomic database to uncover candidate genes associated with drug toxicity. Extensive simulations are conducted to test and compare the performance of many sparse CCA methods, determining which methods are most accurate under a variety of realistic, large-data scenarios. Finally, the performance of the non-parametric bootstrap is examined, determining its ability to generate inferential measures for sparse CCA.
Results: Through applications, several groups of candidate genes are obtained to point researchers towards promising genetic profiles of drug toxicity. Simulations expose one sparse CCA method that outperforms the rest in the majority of data scenarios, while suggesting the use of a combination of complimentary sparse CCA methods for specific data conditions. Simulations for the bootstrap conclude the bootstrap to be a suitable means for inference for the canonical correlation coefficient for sparse CCA but only when sample size approaches the number of variables. As well, it is shown that aggregating sparse CCA results from many bootstrap samples can improve accuracy of detection of truly cross-correlated features.
Conclusions: Sparse multivariate methods can flexibly handle challenging integrative analysis tasks. Work in this thesis has demonstrated their much-needed utility in the field of toxicogenomics and strengthened our knowledge about how they perform within a complex, massive data framework, while promoting the use of bootstrapped inferential measures. / Thesis / Doctor of Philosophy (PhD) / Due to rapid advances in technology, many areas of scientific research are measuring multiple sources of massive, complex, and diverse data in hopes to better understand the principles underpinning puzzling phenomena. Now, more than ever, advancement and discovery relies upon sophisticated and robust statistical and computational methods that reduce the data complexity, harness variability, and integrate multiple sources of information. In this thesis, I test and validate the ‘sparse’ class of multivariate statistical methods that is becoming a promising, fresh solution to these data-driven challenges. Using publicly available data from genetic toxicology as motivation, I demonstrate the utility of these methods, find where they work best, and explore the possibility of improving their scientific interpretability. The work in this thesis contributes to both biostatistics and genomic literature, by meshing together rigorous statistical methodology with real-world data applications.

Identiferoai:union.ndltd.org:mcmaster.ca/oai:macsphere.mcmaster.ca:11375/24009
Date January 2018
CreatorsBonner, Ashley
ContributorsBeyene, Joseph, Hamid, Jemila, Canty, Angelo, Health Research Methodology
Source SetsMcMaster University
LanguageEnglish
Detected LanguageEnglish
TypeThesis

Page generated in 0.0013 seconds