Global ETD Search

Return to search

Statistical Learning of Proteomics Data and Global Testing for Data with Correlations

<div>This dissertation consists of two parts. The first part is a collaborative project with Dr. Szymanski's group in Agronomy at Purdue, to predict protein complex assemblies and interactions. Proteins in the leaf cytosol of Arabidopsis were fractionated using Size Exclusion Chromatography (SEC) and mixed-bed Ion Exchange Chromatography (IEX).</div><div>Protein mass spectrometry data were obtained for the two platforms of separation and two replicates of each. We combine the four data sets and conduct a series of statistical learning, including 1) data filtering, 2) a two-round hierarchical clustering to integrate multiple data types, 3) validation of clustering based on known protein complexes,</div><div>4) mining dendrogram trees for prediction of protein complexes. Our method is developed for integrative analysis of different data types and it eliminates the difficulty of choosing an appropriate cluster number in clustering analysis. It provides a statistical learning tool to globally analyze the oligomerization state of a system of protein complexes.</div><div><br></div><div><br></div><div>The second part examines global hypothesis testing under sparse alternatives and arbitrarily strong dependence. Global tests are used to aggregate information and reduce the burden of multiple testing. A common situation in modern data analysis is that variables with nonzero effects are sparse. The minimum p-value and higher criticism tests are particularly effective and more powerful than the F test under sparse alternatives. This is the common setting in genome-wide association study (GWAS) data. However, arbitrarily strong dependence among variables poses a great challenge towards the p-value calculation of these optimal tests. We develop a latent variable adjusted method to correct minimum p-value test. After adjustment, test statistics become weakly dependent and the corresponding null distributions are valid. We show that if the latent variable is not related to the response variable, power can be improved. Simulation studies show that our method is more powerful than other methods in highly sparse signal and correlated marginal tests setting. We also show its application in a real dataset.</div>

10.25394/pgs.7776230.v1

mass spectrometry proteomic methods

Global test

Genome-Wide Association Study

Latent Factor Modeling

Correlation structures

Identifer	oai:union.ndltd.org:purdue.edu/oai:figshare.com:article/7776230
Date	15 May 2019
Creators	Donglai Chen (6405944)
Source Sets	Purdue University
Detected Language	English
Type	Text, Thesis
Rights	CC BY 4.0
Relation	https://figshare.com/articles/Statistical_Learning_of_Proteomics_Data_and_Global_Testing_for_Data_with_Correlations/7776230

Page generated in 0.0026 seconds

Statistical Learning of Proteomics Data and Global Testing for Data with Correlations

Description

Links & Downloads

Tags

Additional Fields