1 |
Investigating the Process of Developing a KDD Model for the Classification of Cases with Cardiovascular Disease Based on a Canadian DatabaseLiu, Chenyu January 2012 (has links)
Medicine and health domains are information intensive fields as data volume has been
increasing constantly from them. In order to make full use of the data, the technique of
Knowledge Discovery in Databases (KDD) has been developed as a comprehensive pathway
to discover valid and unsuspected patterns and trends that are both understandable and useful to data analysts.
The present study aimed to investigate the entire KDD process of developing a classification model for cardiovascular disease (CVD) from a Canadian dataset for the first time. The research data source was Canadian Heart Health Database, which contains 265 easily collected variables and 23,129 instances from ten Canadian provinces. Many practical issues involving in different steps of the integrated process were addressed, and possible solutions were suggested based on the experimental results. Five specific learning schemes representing five distinct KDD approaches were employed, as they were never compared with one another. In addition, two improving approaches including cost-sensitive learning and ensemble learning were also examined. The performance of developed models was
measured in many aspects. The data set was prepared through data cleaning and missing value imputation. Three pairs of experiments demonstrated that the dataset balancing and outlier removal exerted positive influence to the classifier, but the variable normalization was not helpful. Three combinations of subset generation method and evaluation function were tested in variable
subset selection phase, and the combination of Best-First search and Correlation-based
Feature Selection showed comparable goodness and was maintained for other benefits.
Among the five learning schemes investigated, C4.5 decision tree achieved the best
performance on the classification of CVD, followed by Multilayer Feed-forward Network, KNearest Neighbor, Logistic Regression, and Naïve Bayes. Cost-sensitive learning exemplified by the MetaCost algorithm failed to outperform the single C4.5 decision tree when varying the cost matrix from 5:1 to 1:7. In contrast, the models developed from ensemble modeling, especially AdaBoost M1 algorithm, outperformed other models.
Although the model with the best performance might be suitable for CVD screening in
general Canadian population, it is not ready to use in practice. I propose some criteria to improve the further evaluation of the model. Finally, I describe some of the limitations of the study and propose potential solutions to address such limitations through out the KDD process. Such possibilities should be explored in further research.
|
2 |
Investigating the Process of Developing a KDD Model for the Classification of Cases with Cardiovascular Disease Based on a Canadian DatabaseLiu, Chenyu January 2012 (has links)
Medicine and health domains are information intensive fields as data volume has been
increasing constantly from them. In order to make full use of the data, the technique of
Knowledge Discovery in Databases (KDD) has been developed as a comprehensive pathway
to discover valid and unsuspected patterns and trends that are both understandable and useful to data analysts.
The present study aimed to investigate the entire KDD process of developing a classification model for cardiovascular disease (CVD) from a Canadian dataset for the first time. The research data source was Canadian Heart Health Database, which contains 265 easily collected variables and 23,129 instances from ten Canadian provinces. Many practical issues involving in different steps of the integrated process were addressed, and possible solutions were suggested based on the experimental results. Five specific learning schemes representing five distinct KDD approaches were employed, as they were never compared with one another. In addition, two improving approaches including cost-sensitive learning and ensemble learning were also examined. The performance of developed models was
measured in many aspects. The data set was prepared through data cleaning and missing value imputation. Three pairs of experiments demonstrated that the dataset balancing and outlier removal exerted positive influence to the classifier, but the variable normalization was not helpful. Three combinations of subset generation method and evaluation function were tested in variable
subset selection phase, and the combination of Best-First search and Correlation-based
Feature Selection showed comparable goodness and was maintained for other benefits.
Among the five learning schemes investigated, C4.5 decision tree achieved the best
performance on the classification of CVD, followed by Multilayer Feed-forward Network, KNearest Neighbor, Logistic Regression, and Naïve Bayes. Cost-sensitive learning exemplified by the MetaCost algorithm failed to outperform the single C4.5 decision tree when varying the cost matrix from 5:1 to 1:7. In contrast, the models developed from ensemble modeling, especially AdaBoost M1 algorithm, outperformed other models.
Although the model with the best performance might be suitable for CVD screening in
general Canadian population, it is not ready to use in practice. I propose some criteria to improve the further evaluation of the model. Finally, I describe some of the limitations of the study and propose potential solutions to address such limitations through out the KDD process. Such possibilities should be explored in further research.
|
3 |
Example Based Learning for View-Based Human Face DetectionSung, Kah Kay, Poggio, Tomaso 24 January 1995 (has links)
We present an example-based learning approach for locating vertical frontal views of human faces in complex scenes. The technique models the distribution of human face patterns by means of a few view-based "face'' and "non-face'' prototype clusters. At each image location, the local pattern is matched against the distribution-based model, and a trained classifier determines, based on the local difference measurements, whether or not a human face exists at the current image location. We provide an analysis that helps identify the critical components of our system.
|
Page generated in 0.1178 seconds