1 |
Investigating the Process of Developing a KDD Model for the Classification of Cases with Cardiovascular Disease Based on a Canadian DatabaseLiu, Chenyu January 2012 (has links)
Medicine and health domains are information intensive fields as data volume has been
increasing constantly from them. In order to make full use of the data, the technique of
Knowledge Discovery in Databases (KDD) has been developed as a comprehensive pathway
to discover valid and unsuspected patterns and trends that are both understandable and useful to data analysts.
The present study aimed to investigate the entire KDD process of developing a classification model for cardiovascular disease (CVD) from a Canadian dataset for the first time. The research data source was Canadian Heart Health Database, which contains 265 easily collected variables and 23,129 instances from ten Canadian provinces. Many practical issues involving in different steps of the integrated process were addressed, and possible solutions were suggested based on the experimental results. Five specific learning schemes representing five distinct KDD approaches were employed, as they were never compared with one another. In addition, two improving approaches including cost-sensitive learning and ensemble learning were also examined. The performance of developed models was
measured in many aspects. The data set was prepared through data cleaning and missing value imputation. Three pairs of experiments demonstrated that the dataset balancing and outlier removal exerted positive influence to the classifier, but the variable normalization was not helpful. Three combinations of subset generation method and evaluation function were tested in variable
subset selection phase, and the combination of Best-First search and Correlation-based
Feature Selection showed comparable goodness and was maintained for other benefits.
Among the five learning schemes investigated, C4.5 decision tree achieved the best
performance on the classification of CVD, followed by Multilayer Feed-forward Network, KNearest Neighbor, Logistic Regression, and Naïve Bayes. Cost-sensitive learning exemplified by the MetaCost algorithm failed to outperform the single C4.5 decision tree when varying the cost matrix from 5:1 to 1:7. In contrast, the models developed from ensemble modeling, especially AdaBoost M1 algorithm, outperformed other models.
Although the model with the best performance might be suitable for CVD screening in
general Canadian population, it is not ready to use in practice. I propose some criteria to improve the further evaluation of the model. Finally, I describe some of the limitations of the study and propose potential solutions to address such limitations through out the KDD process. Such possibilities should be explored in further research.
|
2 |
Investigating the Process of Developing a KDD Model for the Classification of Cases with Cardiovascular Disease Based on a Canadian DatabaseLiu, Chenyu January 2012 (has links)
Medicine and health domains are information intensive fields as data volume has been
increasing constantly from them. In order to make full use of the data, the technique of
Knowledge Discovery in Databases (KDD) has been developed as a comprehensive pathway
to discover valid and unsuspected patterns and trends that are both understandable and useful to data analysts.
The present study aimed to investigate the entire KDD process of developing a classification model for cardiovascular disease (CVD) from a Canadian dataset for the first time. The research data source was Canadian Heart Health Database, which contains 265 easily collected variables and 23,129 instances from ten Canadian provinces. Many practical issues involving in different steps of the integrated process were addressed, and possible solutions were suggested based on the experimental results. Five specific learning schemes representing five distinct KDD approaches were employed, as they were never compared with one another. In addition, two improving approaches including cost-sensitive learning and ensemble learning were also examined. The performance of developed models was
measured in many aspects. The data set was prepared through data cleaning and missing value imputation. Three pairs of experiments demonstrated that the dataset balancing and outlier removal exerted positive influence to the classifier, but the variable normalization was not helpful. Three combinations of subset generation method and evaluation function were tested in variable
subset selection phase, and the combination of Best-First search and Correlation-based
Feature Selection showed comparable goodness and was maintained for other benefits.
Among the five learning schemes investigated, C4.5 decision tree achieved the best
performance on the classification of CVD, followed by Multilayer Feed-forward Network, KNearest Neighbor, Logistic Regression, and Naïve Bayes. Cost-sensitive learning exemplified by the MetaCost algorithm failed to outperform the single C4.5 decision tree when varying the cost matrix from 5:1 to 1:7. In contrast, the models developed from ensemble modeling, especially AdaBoost M1 algorithm, outperformed other models.
Although the model with the best performance might be suitable for CVD screening in
general Canadian population, it is not ready to use in practice. I propose some criteria to improve the further evaluation of the model. Finally, I describe some of the limitations of the study and propose potential solutions to address such limitations through out the KDD process. Such possibilities should be explored in further research.
|
3 |
SISTEMA INTEGRADO DE MONITORAMENTO E CONTROLE DA QUALIDADE DE COMBUSTÍVEL / INTEGRATED SYSTEMS OF TRACKING AND QUALITY CONTROL OF FUELMarques, Delano Brandes 27 February 2004 (has links)
Made available in DSpace on 2016-08-17T14:52:51Z (GMT). No. of bitstreams: 1
Delano Brandes Marques.pdf: 3918036 bytes, checksum: 599a5c86f30b5b6799c9afd54e7b5de7 (MD5)
Previous issue date: 2004-02-27 / This work aims the implantation of an Integrated System that, besides
allowing a better, more efficient and more practical monitoring, makes possible the
control and optimization of problems related to the oil industry. In order to
guarantee fuel s quality and normalization, the development of efficient tools that
allow it s monitoring of any point (anywhere) and for any type of fuel is
indispensable. Considering the variety of criteria, a decision making should be
based on the evaluation of the most varied types of space data and not space
data. In this sense, Knowledge Discovery in Databases process is used, where the
Data Warehouse and Data Mining steps allied to a Geographic Information System
are emphasized. This system presents as objective including several fuel
monitoring regions. From different information obtained in the ANP databases, an
analysis was carried out and a Data Warehouse model proposed. In the sequel,
Data Mining techniques (Principal Component Analysis, Clustering Analysis and
Multiple Regression) were applied to the results in order to obtain knowledge
(patterns). / O presente trabalho apresenta estudos que visam a implantação de um
Sistema Integrado que, além de permitir um melhor monitoramento, praticidade e
eficiência, possibilite o controle e otimização de problemas relacionados à indústria
de petróleo. Para garantir qualidade e normalização do combustível, é indispensável
o desenvolvimento de ferramentas eficientes que permitam o seu monitoramento de
qualquer ponto e para qualquer tipo de combustível. Considerando a variedade dos
critérios, uma tomada de decisão deve ser baseada na avaliação dos mais variados
tipos de dados espaciais e não espaciais. Para isto, é utilizado o Processo de
Descoberta de Conhecimento, onde são enfatizadas as etapas de Data Warehouse
e Data Mining aliadas ao conceito de um Sistema de Informação Geográfica. O
sistema tem por objetivo abranger várias regiões de monitoramento de combustíveis.
A partir do levantamento e análise das diferentes informações usadas nos bancos de
dados da ANP foi proposto um modelo de data warehouse. Na seqüência foram
aplicadas técnicas de mineração de dados (Análise de Componentes Principais,
Análise de Agrupamento e Regressão) visando à obtenção de conhecimento
(padrões).
|
4 |
Influence of retraint systems during an automobile crash : prediction of injuries for frontal impact sled tests based on biomechanical data mining / Infkuence des systèmes de retenue lors d'un accident automobile : Prédiction des blessures de l'occupant lors d'essais catapultés frontaux basées sur le data miningCridelich, Carine caroline 17 December 2015 (has links)
La sécurité automobile est l’une des principales considérations lors de l’achat d’un véhicule. Avant d’ être commercialisée, une voiture doit répondre aux normes de sécurité du pays, ce qui conduit au développement de systèmes de retenue tels que les airbags et ceintures de sécurité. De plus, des ratings comme EURO NCAP et US NCAP permettent d’évaluer de manière indépendante la sécurité de la voiture. Des essais catapultes sont entre autres effectués pour confirmer le niveau de protection du véhicule et les résultats sont généralement basés sur des valeurs de référence des dommages corporels dérivés de paramètres physiques mesurés dans les mannequins.Cette thèse doctorale présente une approche pour le traitement des données d’entrée (c’est-à-dire des paramètres des systèmes de retenue définis par des experts) suivie d’une classification des essais catapultes frontaux selon ces mêmes paramètres. L’étude est uniquement basée sur les données du passager, les données collectées pour le conducteur n’ étant pas assez complètes pour produire des résultats satisfaisants. L’objectif principal est de créer un modèle qui définit l’influence des paramètres d’entrées sur la sévérité des dommages et qui aide les ingénieurs à avoir un ordre de grandeur des résultats des essais catapultes selon la législation ou le rating choisi. Les valeurs biomécaniques du mannequin (outputs du modèle) ont été regroupées en clusters dans le but de définir des niveaux de dommages corporels. Le modèle ainsi que les différents algorithmes ont été implémentés dans un programme pour une meilleur utilisation quotidienne. / Safety is one of the most important considerations when buying a new car. The car has to achievecrash tests defined by the legislation before being selling in a country, what drives to the developmentof safety systems such as airbags and seat belts. Additionally, ratings like EURO NCAP and US NCAPenable to provide an independent evaluation of the car safety. Frontal sled tests are thus carried outto confirm the protection level of the vehicle and the results are mainly based on injury assessmentreference values derived from physical parameters measured in dummies.This doctoral thesis presents an approach for the treatment of the input data (i.e. parameters ofthe restraint systems defined by experts) followed by a classification of frontal sled tests accordingto those parameters. The study is only based on data from the passenger side, the collected datafor the driver were not enough completed to produce satisfying results. The main objective is tocreate a model that evaluates the input parameters’ influence on the injury severity and helps theengineers having a prediction of the sled tests results according to the chosen legislation or rating.The dummy biomechanical values (outputs of the model) have been regrouped into clusters in orderto define injuries groups. The model and various algorithms have been implemented in a GraphicalUser Interface for a better practical daily use.
|
Page generated in 0.0365 seconds