Global ETD Search

Return to search

Canonical Correlation and Clustering for High Dimensional Data

Multi-view datasets arise naturally in statistical genetics when the genetic
and trait profile of an individual is portrayed by two feature vectors.
A motivating problem concerning the Skin Intrinsic Fluorescence (SIF)
study on the Diabetes Control and Complications Trial (DCCT) subjects
is presented. A widely applied quantitative method to explore the correlation
structure between two domains of a multi-view dataset is the
Canonical Correlation Analysis (CCA), which seeks the canonical loading
vectors such that the transformed canonical covariates are maximally
correlated. In the high dimensional case, regularization of the dataset is
required before CCA can be applied. Furthermore, the nature of genetic
research suggests that sparse output is more desirable. In this thesis, two
regularized CCA (rCCA) methods and a sparse CCA (sCCA) method
are presented. When correlation sub-structure exists, stand-alone CCA
method will not perform well. To tackle this limitation, a mixture of
local CCA models can be employed. In this thesis, I review a correlation
clustering algorithm proposed by Fern, Brodley and Friedl (2005),
which seeks to group subjects into clusters such that features are identically
correlated within each cluster. An evaluation study is performed
to assess the effectiveness of CCA and correlation clustering algorithms
using artificial multi-view datasets. Both sCCA and sCCA-based correlation
clustering exhibited superior performance compare to the rCCA and
rCCA-based correlation clustering. The sCCA and the sCCA-clustering
are applied to the multi-view dataset consisted of PrediXcan imputed gene
expression and SIF measurements of DCCT subjects. The stand-alone
sparse CCA method identified 193 among 11538 genes being correlated
with SIF#7. Further investigation of these 193 genes with simple linear
regression and t-test revealed that only two genes, ENSG00000100281.9
and ENSG00000112787.8, were significance in association with SIF#7. No
plausible clustering scheme was detected by the sCCA based correlation
clustering method. / Thesis / Master of Science (MSc)

http://hdl.handle.net/11375/24218

Machine Learning

Correlation Clustering

Sparse Canonical Correlation Analysis

Skin Intrinsic Fluorescence

Multi-view dataset

Lasso

Dimensionality reduction

PrediXcan

High dimensional data

Identifer	oai:union.ndltd.org:mcmaster.ca/oai:macsphere.mcmaster.ca:11375/24218
Date	January 2019
Creators	Ouyang, Qing
Contributors	Canty, Angelo, Mathematics and Statistics
Source Sets	McMaster University
Language	English
Detected Language	English
Type	Thesis

Page generated in 0.0017 seconds

Canonical Correlation and Clustering for High Dimensional Data

Description

Links & Downloads

Tags

Additional Fields