Return to search

Statistical Methods for High Dimensional Data in Environmental Genomics

In this dissertation, we propose methodology to analyze high dimensional genomics data, in which the observations have large number of outcome variables, in addition to exposure variables. In the Chapter 1, we investigate methods for genetic pathway analysis, where we have a small number of exposure variables. We propose two Canonical Correlation Analysis based methods, that select outcomes either sequentially or by screening, and show that the performance of the proposed methods depend on the correlation between the genes in the pathway. We also propose and investigate criterion for fixing the number of outcomes, and a powerful test for the exposure effect on the pathway. The methodology is applied to show that air pollution exposure affects gene methylation of a few genes from the asthma pathway. In Chapter 2, we study penalized multivariate regression as an efficient and flexible method to study the relationship between large number of covariates and multiple outcomes. We use penalized likelihood to shrink model parameters to zero and to select only the important effects. We use the Bayesian Information Criterion (BIC) to select tuning parameters for the employed penalty and show that it chooses the right tuning parameter with high probability. These are combined in the “two-stage procedure”, and asymptotic results show that it yields consistent, sparse and asymptotically normal estimator of the regression parameters. The method is illustrated on gene expression data in normal and diabetic patients. In Chapter 3 we propose a method for estimation of covariates-dependent principal components analysis (PCA) and covariance matrices. Covariates, such as smoking habits, can affect the variation in a set of gene methylation values. We develop a penalized regression method that incorporates covariates in the estimation of principal components. We show that the parameter estimates are consistent and sparse, and show that using the BIC to select the tuning parameter for the penalty functions yields good models. We also propose the scree plot residual variance criterion for selecting the number of principal components. The proposed procedure is implemented to show that the first three principal components of genes methylation in the asthma pathway are different in people who did not smoke, and people who did.

Identiferoai:union.ndltd.org:harvard.edu/oai:dash.harvard.edu:1/10288451
Date January 2012
CreatorsSofer, Tamar
ContributorsLin, Xihong
PublisherHarvard University
Source SetsHarvard University
Languageen_US
Detected LanguageEnglish
TypeThesis or Dissertation
Rightsclosed access

Page generated in 0.0124 seconds