Return to search

Applying Supervised Learning Algorithms and a New Feature Selection Method to Predict Coronary Artery Disease

From a fresh data science perspective, this thesis discusses the prediction of coronary artery disease based on Single-Nucleotide Polymorphisms (SNPs) from the Ontario Heart Genomics Study (OHGS). First, the thesis explains the k-Nearest Neighbour (k-NN) and Random Forest learning algorithms, and includes a complete proof that k-NN is universally consistent in finite dimensional normed vector spaces. Second, the thesis introduces two dimensionality reduction techniques: Random Projections and a new method termed Mass Transportation Distance (MTD) Feature Selection. Then, this thesis compares the performance of Random Projections with k-NN against MTD Feature Selection and Random Forest for predicting artery disease. Results demonstrate that MTD Feature Selection with Random Forest is superior to Random Projections and k-NN. Random Forest is able to obtain an accuracy of 0.6660 and an area under the ROC curve of 0.8562 on the OHGS dataset, when 3335 SNPs are selected by MTD Feature Selection for classification. This area is considerably better than the previous high score of 0.608 obtained by Davies et al. in 2010 on the same dataset.

Identiferoai:union.ndltd.org:LACETR/oai:collectionscanada.gc.ca:OOU.#10393/31113
Date15 May 2014
CreatorsDuan, Haoyang
Source SetsLibrary and Archives Canada ETDs Repository / Centre d'archives des thèses électroniques de Bibliothèque et Archives Canada
LanguageEnglish
Detected LanguageEnglish
TypeThèse / Thesis

Page generated in 0.0055 seconds