Return to search

Sparse Canonical Correlation Analysis

Large scale genomic studies of the association of gene expression
with multiple phenotypic or genotypic measures may require the
identification of complex multivariate relationships. In
multivariate analysis a common way to inspect the relationship
between two sets of variables based on their correlation is
Canonical Correlation Analysis, which determines linear combinations
of all variables of each type with maximal correlation between the
two linear combinations. However, in high dimensional data analysis,
when the number of variables under consideration exceeds tens of
thousands, linear combinations of the entire sets of features may
lack biological plausibility and interpretability. In addition,
insufficient sample size may lead to computational problems,
inaccurate estimates of parameters and non-generalizable results.
These problems may be solved by selecting sparse subsets of
variables, i.e. obtaining sparse loadings in the linear combinations
of variables of each type. However, available methods providing
sparse solutions, such as Sparse Principal Component Analysis,
consider each type of variables separately and focus on the
correlation within each set of measurements rather than between
sets. We introduce new methodology - Sparse Canonical Correlation
Analysis (SCCA), which examines the relationships of many variables
of different types simultaneously. It solves the problem of
biological interpretability by providing sparse linear combinations
that include only a small subset of variables. SCCA maximizes the
correlation between the subsets of variables of different types
while performing variable selection. In large scale genomic studies
sparse solutions also comply with the belief that only a small
proportion of genes are expressed under a certain set of conditions.
In this thesis I present methodology for SCCA and evaluate its
properties using simulated data. I illustrate practical use of SCCA
by applying it to the study of natural variation in human gene
expression for which the data have been provided as problem 1 for
the fifteenth Genetic Analysis Workshop (GAW15). I also present two
extensions of SCCA - adaptive SCCA and modified adaptive SCCA. Their
performance is evaluated and compared using simulated data and
adaptive SCCA is applied to the GAW15 data.

Identiferoai:union.ndltd.org:LACETR/oai:collectionscanada.gc.ca:OTU.1807/11243
Date01 August 2008
CreatorsParkhomenko, Elena
ContributorsTritchler, David, Beyene, Joseph
Source SetsLibrary and Archives Canada ETDs Repository / Centre d'archives des thèses électroniques de Bibliothèque et Archives Canada
Languageen_ca
Detected LanguageEnglish
TypeThesis
Format898248 bytes, application/pdf

Page generated in 0.0018 seconds