Global ETD Search

Return to search

Applying Supervised Learning Algorithms and a New Feature Selection Method to Predict Coronary Artery Disease

From a fresh data science perspective, this thesis discusses the prediction of coronary artery disease based on Single-Nucleotide Polymorphisms (SNPs) from the Ontario Heart Genomics Study (OHGS). First, the thesis explains the k-Nearest Neighbour (k-NN) and Random Forest learning algorithms, and includes a complete proof that k-NN is universally consistent in finite dimensional normed vector spaces. Second, the thesis introduces two dimensionality reduction techniques: Random Projections and a new method termed Mass Transportation Distance (MTD) Feature Selection. Then, this thesis compares the performance of Random Projections with k-NN against MTD Feature Selection and Random Forest for predicting artery disease. Results demonstrate that MTD Feature Selection with Random Forest is superior to Random Projections and k-NN. Random Forest is able to obtain an accuracy of 0.6660 and an area under the ROC curve of 0.8562 on the OHGS dataset, when 3335 SNPs are selected by MTD Feature Selection for classification. This area is considerably better than the previous high score of 0.608 obtained by Davies et al. in 2010 on the same dataset.

SNPs

GWAS

Data Science

Mass Transportation Distance

Dimensionality Reduction

Random Projections

Supervised Learning Theory

Coronary Artery Disease

K-Nearest Neighbour Classifier

Universal Consistency

Identifer	oai:union.ndltd.org:uottawa.ca/oai:ruor.uottawa.ca:10393/31113
Date	January 2014
Creators	Duan, Haoyang
Contributors	Pestov, Vladimir, Wells, George
Publisher	Université d'Ottawa / University of Ottawa
Source Sets	Université d’Ottawa
Language	English
Detected Language	English
Type	Thesis

Page generated in 0.0035 seconds

Applying Supervised Learning Algorithms and a New Feature Selection Method to Predict Coronary Artery Disease

Description

Links & Downloads

Tags

Additional Fields