Global ETD Search

1	Development of infant feeding algorithms Zhu, Yeyi 29 November 2012 (has links) Dietary factors in early life (infant feeding practices and timing of introduction of solid foods) are the most potentially modifiable early life exposures associated with childhood growth, as compared to genetic determinants, co-morbidity, and other environmental influences. Yet, studies assessing the association of infant feeding with growth may be limited by out-of-date data and unable to compare results due to inconsistent definitions of infant feeding practices. Mixed feeding (i.e. breast and bottle feeding) calls for special attention due to the reality of mothers returning to work after childbirth in the US. This report used data from the National Children’s Study Formative Research in Physical Measurements. A discovery set of 300 participants were selected by ethnicity from the sample available when this report was developed. This report emphasized statistical methods as well as data pre-processing, which are critical but typically under-studied. This report is intended to contribute towards closing this gap by describing a study from design, data pre-processing, to analysis. Results showed that non-Hispanic Black children had the lowest rates of ever and exclusively breastfeeding, compared to Hispanics and non-Hispanic Whites. Mothers aged 30 years and over, married, and educated above the high school level exclusively breastfed more than other mothers. Mixed feeding was categorized into three and five subgroups according to maternal recalls of the extent or frequency of breast/formula-feeding and compared by mean durations of breast/formula-feeding. Mixed feeding groups may provide unique opportunities to assess the relationship between mixed feeding versus exclusively breast/formula-feeding and childhood linear growth in the author’s dissertation. The percentage of children who were breastfed less than 6 months differed from those breastfed more than 6 months by ethnicity, child’s birthweight, gestational age, maternal age at childbirth, education level, and marital status, which suggests 6 months as a reasonable cut-off for breastfeeding categorization. Children of low birthweight and born preterm were introduced to solid foods later than those of normal/high birthweight and those born on time/postterm, even after adjusting for ethnicity. Analyses on a re-test set will be performed and compared to this discovery set in the author’s dissertation. / text Data pre-processing Infant feeding algorithms
2	DIMENSIONALITY REDUCTION FOR DATA DRIVEN PROCESS MODELING DWIVEDI, SAURABH January 2003 (has links) No description available. data pre-processing dimensionality reduction clustering PCA process modeling
3	"Pré-processamento de dados em aprendizado de máquina supervisionado" / "Data pre-processing for supervised machine learning" Batista, Gustavo Enrique de Almeida Prado Alves 16 May 2003 (has links) A qualidade de dados é uma das principais preocupações em Aprendizado de Máquina - AM -cujos algoritmos são freqüentemente utilizados para extrair conhecimento durante a fase de Mineração de Dados - MD - da nova área de pesquisa chamada Descoberta de Conhecimento de Bancos de Dados. Uma vez que a maioria dos algoritmos de aprendizado induz conhecimento estritamente a partir de dados, a qualidade do conhecimento extraído é amplamente determinada pela qualidade dos dados de entrada. Diversos aspectos podem influenciar no desempenho de um sistema de aprendizado devido à qualidade dos dados. Em bases de dados reais, dois desses aspectos estão relacionados com (i) a presença de valores desconhecidos, os quais são tratados de uma forma bastante simplista por diversos algoritmos de AM, e; (ii) a diferença entre o número de exemplos, ou registros de um banco de dados, que pertencem a diferentes classes, uma vez que quando essa diferença é expressiva, sistemas de aprendizado podem ter dificuldades em aprender o conceito relacionado com a classe minoritária. O problema de tratamento de valores desconhecidos é de grande interesse prático e teórico. Em diversas aplicações é importante saber como proceder quando as informações disponíveis estão incompletas ou quando as fontes de informações se tornam indisponíveis. O tratamento de valores desconhecidos deve ser cuidadosamente planejado, caso contrário, distorções podem ser introduzidas no conhecimento induzido. Neste trabalho é proposta a utilização do algoritmo k-vizinhos mais próximos como método de imputação. Imputação é um termo que denota um procedimento que substitui os valores desconhecidos de um conjunto de dados por valores plausíveis. As análises conduzidas neste trabalho indicam que a imputação de valores desconhecidos com base no algoritmo k-vizinhos mais próximos pode superar o desempenho das estratégias internas utilizadas para tratar valores desconhecidos pelos sistemas C4.5 e CN2, bem como a imputação pela média ou moda, um método amplamente utilizado para tratar valores desconhecidos. O problema de aprender a partir de conjuntos de dados com classes desbalanceadas é de crucial importância, uma vez que esses conjuntos de dados podem ser encontrados em diversos domínios. Classes com distribuições desbalanceadas podem se constituir em um gargalo significante no desempenho obtido por sistemas de aprendizado que assumem uma distribuição balanceada das classes. Uma solução para o problema de aprendizado com distribuições desbalanceadas de classes é balancear artificialmente o conjunto de dados. Neste trabalho é avaliado o uso do método de seleção unilateral, o qual realiza uma remoção cuidadosa dos casos que pertencem à classe majoritária, mantendo os casos da classe minoritária. Essa remoção cuidadosa consiste em detectar e remover casos considerados menos confiáveis, por meio do uso de algumas heurísticas. Uma vez que não existe uma análise matemática capaz de predizer se o desempenho de um método é superior aos demais, análises experimentais possuem um papel importante na avaliação de sistema de aprendizado. Neste trabalho é proposto e implementado o ambiente computacional Discover Learning Environmnet - DLE - o qual é um em framework para desenvolver e avaliar novos métodos de pré-processamento de dados. O ambiente DLE é integrado ao projeto Discover, um projeto de pesquisa em desenvolvimento em nosso laboratório para planejamento e execução de experimentos relacionados com o uso de sistemas de aprendizado durante a fase de Mineração de dados do processo de KDD. / Data quality is a major concern in Machine Learning, which is frequently used to extract knowledge during the Data Mining phase of the relatively new research area called Knowledge Discovery from Databases - KDD. As most Machine Learning algorithms induce knowledge strictly from data, the quality of the knowledge extracted is largely determined by the quality of the underlying data. Several aspects may influence the performance of a learning system due to data quality. In real world databases, two of these aspects are related to (i) the presence of missing data, which is handled in a rather naive way by many Machine Learning algorithms; (ii) the difference between the number of examples, or database records, that belong to different classes since, when this difference is large, learning systems may have difficulties to learn the concept related to the minority class. The problem of missing data is of great practical and theoretical interest. In many applications it is important to know how to react if the available information is incomplete or if sources of information become unavailable. Missing data treatment should be carefully thought, otherwise bias might be introduced into the knowledge induced. In this work, we propose the use of the k-nearest neighbour algorithm as an imputation method. Imputation is a term that denotes a procedure that replaces the missing values in a data set by some plausible values. Our analysis indicates that missing data imputation based on the k-nearest neighbour algorithm can outperform the internal missing data treatment strategies used by C4.5 and CN2, and the mean or mode imputation, a widely used method for treating missing values. The problem of learning from imbalanced data sets is of crucial importance since it is encountered in a large number of domains. Imbalanced class distributions might cause a significant bottleneck in the performance obtained by standard learning methods, which assume a balanced distribution of the classes. One solution to the problem of learning with skewed class distributions is to artificially balance the data set. In this work we propose the use of the one-sided selection method, which performs a careful removal of cases belonging to the majority class while leaving untouched all cases from the minority class. Such careful removal consists of detecting and removing cases considered less reliable, using some heuristics. An experimental application confirmed the efficiency of the proposed method. As there is not a mathematical analysis able to predict whether the performance of a learning system is better than others, experimentation plays an important role for evaluating learning systems. In this work we propose and implement a computational environment, the Discover Learning Environment - DLE - which is a framework to develop and evaluate new data pre-processing methods. The DLE is integrated into the Discover project, a major research project under development in our laboratory for planning and execution of experiments related to the use of learning systems during the Data Mining phase of the KDD process. aprendizado de máquina data mining data pre-processing machine learning mineração de dados pré-processamento de dados
4	Eﬀective Strategies for Improving Peptide Identiﬁcation with Tandem Mass Spectrometry Han, Xi January 2011 (has links) Tandem mass spectrometry (MS/MS) has been routinely used to identify peptides from protein mixtures in the field of proteomics. However, only about 30% to 40% of current MS/MS spectra can be identified, while many of them remain unassigned, even though they are of reasonable quality. The ubiquitous presence of post-translational modifications (PTMs) is one of the reasons for current low spectral identification rate. In order to identify post-translationally modified peptides, most existing software requires the specification of a few possible modifications. However, such knowledge of possible modifications is not always available. In this thesis, we describe a new algorithm for identifying modified peptides without requiring users to specify the possible modifications before the search routine; instead, all modifications from the Unimod database are considered. Meanwhile, several new techniques are employed to avoid the exponential growth of the search space, as well as to control the false discoveries due to this unrestricted search approach. A software tool, PeaksPTM, has been developed and it has already achieved a stronger performance than competitive tools for unrestricted identification of post-translationally modified peptides. Another important reason for the failure of the search tools is the inaccurate mass or charge state measurement of the precursor peptide ion. In this thesis, we study the precursor mono-isotopic mass and charge determination problem, and propose an algorithm to correct precursor ion mass error by assessing the isotopic features in its parent MS spectrum. The algorithm has been tested on two annotated data sets and achieved almost 100 percent accuracy. Furthermore, we have studied a more complicated problem, the MS/MS preprocessing problem, and propose a spectrum deconvolution algorithm. Experiments were provided to compare its performance with other existing software. Bioinformatics Mass Spectrometry peptide identification post-translational modification data pre-processing Proteomics Computer Science
5	Eﬀective Strategies for Improving Peptide Identiﬁcation with Tandem Mass Spectrometry Han, Xi January 2011 (has links) Tandem mass spectrometry (MS/MS) has been routinely used to identify peptides from protein mixtures in the field of proteomics. However, only about 30% to 40% of current MS/MS spectra can be identified, while many of them remain unassigned, even though they are of reasonable quality. The ubiquitous presence of post-translational modifications (PTMs) is one of the reasons for current low spectral identification rate. In order to identify post-translationally modified peptides, most existing software requires the specification of a few possible modifications. However, such knowledge of possible modifications is not always available. In this thesis, we describe a new algorithm for identifying modified peptides without requiring users to specify the possible modifications before the search routine; instead, all modifications from the Unimod database are considered. Meanwhile, several new techniques are employed to avoid the exponential growth of the search space, as well as to control the false discoveries due to this unrestricted search approach. A software tool, PeaksPTM, has been developed and it has already achieved a stronger performance than competitive tools for unrestricted identification of post-translationally modified peptides. Another important reason for the failure of the search tools is the inaccurate mass or charge state measurement of the precursor peptide ion. In this thesis, we study the precursor mono-isotopic mass and charge determination problem, and propose an algorithm to correct precursor ion mass error by assessing the isotopic features in its parent MS spectrum. The algorithm has been tested on two annotated data sets and achieved almost 100 percent accuracy. Furthermore, we have studied a more complicated problem, the MS/MS preprocessing problem, and propose a spectrum deconvolution algorithm. Experiments were provided to compare its performance with other existing software. Bioinformatics Mass Spectrometry peptide identification post-translational modification data pre-processing Proteomics Computer Science
6	"Pré-processamento de dados em aprendizado de máquina supervisionado" / "Data pre-processing for supervised machine learning" Gustavo Enrique de Almeida Prado Alves Batista 16 May 2003 (has links) A qualidade de dados é uma das principais preocupações em Aprendizado de Máquina - AM -cujos algoritmos são freqüentemente utilizados para extrair conhecimento durante a fase de Mineração de Dados - MD - da nova área de pesquisa chamada Descoberta de Conhecimento de Bancos de Dados. Uma vez que a maioria dos algoritmos de aprendizado induz conhecimento estritamente a partir de dados, a qualidade do conhecimento extraído é amplamente determinada pela qualidade dos dados de entrada. Diversos aspectos podem influenciar no desempenho de um sistema de aprendizado devido à qualidade dos dados. Em bases de dados reais, dois desses aspectos estão relacionados com (i) a presença de valores desconhecidos, os quais são tratados de uma forma bastante simplista por diversos algoritmos de AM, e; (ii) a diferença entre o número de exemplos, ou registros de um banco de dados, que pertencem a diferentes classes, uma vez que quando essa diferença é expressiva, sistemas de aprendizado podem ter dificuldades em aprender o conceito relacionado com a classe minoritária. O problema de tratamento de valores desconhecidos é de grande interesse prático e teórico. Em diversas aplicações é importante saber como proceder quando as informações disponíveis estão incompletas ou quando as fontes de informações se tornam indisponíveis. O tratamento de valores desconhecidos deve ser cuidadosamente planejado, caso contrário, distorções podem ser introduzidas no conhecimento induzido. Neste trabalho é proposta a utilização do algoritmo k-vizinhos mais próximos como método de imputação. Imputação é um termo que denota um procedimento que substitui os valores desconhecidos de um conjunto de dados por valores plausíveis. As análises conduzidas neste trabalho indicam que a imputação de valores desconhecidos com base no algoritmo k-vizinhos mais próximos pode superar o desempenho das estratégias internas utilizadas para tratar valores desconhecidos pelos sistemas C4.5 e CN2, bem como a imputação pela média ou moda, um método amplamente utilizado para tratar valores desconhecidos. O problema de aprender a partir de conjuntos de dados com classes desbalanceadas é de crucial importância, uma vez que esses conjuntos de dados podem ser encontrados em diversos domínios. Classes com distribuições desbalanceadas podem se constituir em um gargalo significante no desempenho obtido por sistemas de aprendizado que assumem uma distribuição balanceada das classes. Uma solução para o problema de aprendizado com distribuições desbalanceadas de classes é balancear artificialmente o conjunto de dados. Neste trabalho é avaliado o uso do método de seleção unilateral, o qual realiza uma remoção cuidadosa dos casos que pertencem à classe majoritária, mantendo os casos da classe minoritária. Essa remoção cuidadosa consiste em detectar e remover casos considerados menos confiáveis, por meio do uso de algumas heurísticas. Uma vez que não existe uma análise matemática capaz de predizer se o desempenho de um método é superior aos demais, análises experimentais possuem um papel importante na avaliação de sistema de aprendizado. Neste trabalho é proposto e implementado o ambiente computacional Discover Learning Environmnet - DLE - o qual é um em framework para desenvolver e avaliar novos métodos de pré-processamento de dados. O ambiente DLE é integrado ao projeto Discover, um projeto de pesquisa em desenvolvimento em nosso laboratório para planejamento e execução de experimentos relacionados com o uso de sistemas de aprendizado durante a fase de Mineração de dados do processo de KDD. / Data quality is a major concern in Machine Learning, which is frequently used to extract knowledge during the Data Mining phase of the relatively new research area called Knowledge Discovery from Databases - KDD. As most Machine Learning algorithms induce knowledge strictly from data, the quality of the knowledge extracted is largely determined by the quality of the underlying data. Several aspects may influence the performance of a learning system due to data quality. In real world databases, two of these aspects are related to (i) the presence of missing data, which is handled in a rather naive way by many Machine Learning algorithms; (ii) the difference between the number of examples, or database records, that belong to different classes since, when this difference is large, learning systems may have difficulties to learn the concept related to the minority class. The problem of missing data is of great practical and theoretical interest. In many applications it is important to know how to react if the available information is incomplete or if sources of information become unavailable. Missing data treatment should be carefully thought, otherwise bias might be introduced into the knowledge induced. In this work, we propose the use of the k-nearest neighbour algorithm as an imputation method. Imputation is a term that denotes a procedure that replaces the missing values in a data set by some plausible values. Our analysis indicates that missing data imputation based on the k-nearest neighbour algorithm can outperform the internal missing data treatment strategies used by C4.5 and CN2, and the mean or mode imputation, a widely used method for treating missing values. The problem of learning from imbalanced data sets is of crucial importance since it is encountered in a large number of domains. Imbalanced class distributions might cause a significant bottleneck in the performance obtained by standard learning methods, which assume a balanced distribution of the classes. One solution to the problem of learning with skewed class distributions is to artificially balance the data set. In this work we propose the use of the one-sided selection method, which performs a careful removal of cases belonging to the majority class while leaving untouched all cases from the minority class. Such careful removal consists of detecting and removing cases considered less reliable, using some heuristics. An experimental application confirmed the efficiency of the proposed method. As there is not a mathematical analysis able to predict whether the performance of a learning system is better than others, experimentation plays an important role for evaluating learning systems. In this work we propose and implement a computational environment, the Discover Learning Environment - DLE - which is a framework to develop and evaluate new data pre-processing methods. The DLE is integrated into the Discover project, a major research project under development in our laboratory for planning and execution of experiments related to the use of learning systems during the Data Mining phase of the KDD process. aprendizado de máquina mineração de dados pré-processamento de dados data mining data pre-processing machine learning
7	Sociální sítě a dobývání znalostí / Social networks and data mining Zvirinský, Peter January 2014 (has links) Recent data mining methods represent modern approaches capable of analyzing large amounts of data and extracting meaningful and potentially useful information from it. In this work, we discuss all the essential steps of the data mining process - including data preparation, storage, cleaning, data analysis as well as visualization of the obtained results. In particular, this work is focused on the data available publicly from the Insolvency Register of the Czech Republic, that comprises all insolvency proceedings commenced after 1. January 2008 in the Czech Republic. With regard to the considered type of data, several data mining methods have been discussed, implemented, tested and evaluated. Among others, the studied techniques include Market Basket Analysis, Bayesian networks and social network analysis. The obtained results reveal several social patterns common in the current Czech society.
8	A Recurrent Neural Network For Battery Capacity Estimations In Electrical Vehicles Corell, Simon January 2019 (has links) This study is an investigation if a recurrent long short-term memory (LSTM) based neural network can be used to estimate the battery capacity in electrical cars. There is an enormous interest in finding the underlying reasons why and how Lithium-ion batteries ages and this study is a part of this broader question. The research questions that have been answered are how well a LSTM model estimates the battery capacity, how the LSTM model is performing compared to a linear model and what parameters that are important when estimating the capacity. There have been other studies covering similar topics but only a few that has been performed on a real data set from real cars driving. With a data science approach, it was discovered that the LSTM model indeed is a powerful model to use for estimation the capacity. It had better accuracy than a linear regression model, but the linear regression model still gave good results. The parameters that implied to be important when estimating the capacity were logically related to the properties of a Lithium-ion battery.En studie över hur väl ett återkommande neuralt nätverk kan estimera kapaciteten hos Litium-ion batteri hos elektroniska fordon, när en en datavetenskaplig strategi har använts. Recurrent Neuralt Network LSTM Linear Regression Lithium-Ion battery Data pre-processing Feature Selection. Media and Communication Technology Medieteknik
9	Polyphenolanalyse in gartenbaulichen Produkten auf der Basis laser-induzierter Fluoreszenzspektroskopie Wulf, Janina Saskia 11 April 2007 (has links) In der gartenbaulichen Forschung gewinnen zerstörungsfreie Produktmonitoringverfahren im Hinblick auf ein verbessertes Prozessmanagement an Bedeutung. Optische Methoden werden bereits in mobilen Systemen und Sortieranlagen zur Produktbewertung in Nachernteprozessen eingesetzt. In der vorliegenden Arbeit wurde ein Beitrag zur quantitativen Bestimmung ernährungsphysiologisch bedeutender Fruchtpolyphenole auf der Basis laser-induzierter Fluoreszenzspektroskopie geleistet. An gelagerten Äpfeln und Möhren wurde die Varianz der Produktfluoreszenz bei verschiedenen Lagerbedingungen mit Hilfe der Hauptkomponentenanalyse ausgewertet, um die Produktentwicklung zerstörungsfrei aufzuzeigen. Für eine angepasste Methode der Datenauswertung wurden hierbei verschiedene Signalvorverarbeitungsmethoden getestet. Die quantitative Bestimmung einzelner Inhaltsstoffe wird in der komplexen pflanzlichen Matrix sowohl beeinflusst durch die Fluoreszenzquantenausbeute als auch Reabsorptions- und Löschungseffekten. Aufbauend auf Untersuchungen an Phenolstandards, Fruchtextrakten und geschnittenem Fruchtgewebe zu Einflussparametern und fluoreszenzspektrokopisch messbaren Konzentrationsbereichen wurden neuere Datenvorverarbeitungsmethoden zur Korrektur angewendet. Kalibriermodelle wurden auf der Basis der fluorimetrisch und chromatographisch ermittelten Werte von Hydroxyzimtsäurederivaten bei Apfel und Erdbeere erarbeitetet und hinsichtlich der Messungenauigkeit in der Kalibrierung und Kreuzvalidierung verglichen. Aufgrund der hohen Variabilität gartenbaulicher Produkte wurden diese Modelle auf einem unabhängigen Datensatz getestet. Mit Hilfe mathematischer orthogonaler Signalkorrektur konnte die für den Polyphenolgehalt nicht relevante Varianz aus den spektralen Daten entfernt und verringerte Kalibrierungs- und Validierungsfehler erzielt werden. Der in der Fluoreszenzanalyse übliche empirische Ansatz mit reflexionskorrigierten Fluoreszenzspektren zu arbeiten führten hingegen zu keiner Fehlerverminderung. / During recent years several research groups focussed on the development of non-destructive product monitoring methods to improve the process management for horticultural products in the entire supply chain. Optical methods have been applied for fruit monitoring in production and postharvest processes using mobile measuring systems or NIR sorting lines. The aim of the present study was to quantitatively determine health promoting native fruit polyphenols by means of laser-induced fluorescence spectroscopy. The variance in the fluorescence signal was detected on apples and carrots stored under different conditions. With the help of principal component analysis the fluorescence spectra were evaluated to visualize senescence effects during storage. Different data pre-processing methods were tested for a descriptive factor analysis regarding the wavelength-dependent intensities as variables. However, in a complex fruit matrix the quantitative determination of fruit compounds is influenced by its fluorescence quantum yield as well as reabsorption and quenching effects. The influence of side-effects was studied in phenol standards, fruit extracts and sliced fruit tissue and spectral data was corrected using new data pre-processing methods.. Calibration models for the polyphenol analyses were built on the fruit fluorescence spectra (apples, strawberries) using the chromatographically analysis of hydroxycinnamic acids as a reference. The uncertainty of the models was evaluated by their root mean squares errors of calibration and cross-validation. The feasibility of the non-destructive analysis in practice is influenced by the high variability of horticultural products. Therefore, the models were validated on an independent test set. The mathematical data pre-processing method of direct orthogonal signal correction removed the non relevant information in the spectral data and resulted in the lowest errors. In comparison, the often applied empirical approach in fluorescence spectroscopy to correct with simultaneously recorded reflectance spectra did not improve the calibration models. Fluoreszenzspektroskopie zerstörungsfrei Polyphenol Kalibrierung Datenvorverarbeitung fluorescence spectroscopy non-destructive polyphenol calibration data pre-processing 630 Landwirtschaft, Veterinärmedizin 39 Landwirtschaft, Garten ddc:630
10	Applications of Knowledge Discovery in Quality Registries - Predicting Recurrence of Breast Cancer and Analyzing Non-compliance with a Clinical Guideline Razavi, Amir Reza January 2007 (has links) In medicine, data are produced from different sources and continuously stored in data depositories. Examples of these growing databases are quality registries. In Sweden, there are many cancer registries where data on cancer patients are gathered and recorded and are used mainly for reporting survival analyses to high level health authorities. In this thesis, a breast cancer quality registry operating in South-East of Sweden is used as the data source for newer analytical techniques, i.e. data mining as a part of knowledge discovery in databases (KDD) methodology. Analyses are done to sift through these data in order to find interesting information and hidden knowledge. KDD consists of multiple steps, starting with gathering data from different sources and preparing them in data pre-processing stages prior to data mining. Data were cleaned from outliers and noise and missing values were handled. Then a proper subset of the data was chosen by canonical correlation analysis (CCA) in a dimensionality reduction step. This technique was chosen because there were multiple outcomes, and variables had complex relationship to one another. After data were prepared, they were analyzed with a data mining method. Decision tree induction as a simple and efficient method was used to mine the data. To show the benefits of proper data pre-processing, results from data mining with pre-processing of the data were compared with results from data mining without data pre-processing. The comparison showed that data pre-processing results in a more compact model with a better performance in predicting the recurrence of cancer. An important part of knowledge discovery in medicine is to increase the involvement of medical experts in the process. This starts with enquiry about current problems in their field, which leads to finding areas where computer support can be helpful. The experts can suggest potentially important variables and should then approve and validate new patterns or knowledge as predictive or descriptive models. If it can be shown that the performance of a model is comparable to domain experts, it is more probable that the model will be used to support physicians in their daily decision-making. In this thesis, we validated the model by comparing predictions done by data mining and those made by domain experts without finding any significant difference between them. Breast cancer patients who are treated with mastectomy are recommended to receive radiotherapy. This treatment is called postmastectomy radiotherapy (PMRT) and there is a guideline for prescribing it. A history of this treatment is stored in breast cancer registries. We analyzed these datasets using rules from a clinical guideline and identified cases that had not been treated according to the PMRT guideline. Data mining revealed some patterns of non-compliance with the PMRT guideline. Further analysis with data mining revealed some reasons for guideline non-compliance. These patterns were then compared with reasons acquired from manual inspection of patient records. The comparisons showed that patterns resulting from data mining were limited to the stored variables in the registry. A prerequisite for better results is availability of comprehensive datasets. Medicine can take advantage of KDD methodology in different ways. The main advantage is being able to reuse information and explore hidden knowledge that can be obtained using advanced analysis techniques. The results depend on good collaboration between medical informaticians and domain experts and the availability of high quality data. Breast cancer Clinical guidelines Canonical correlation analysis Data Mining Data pre-processing Decision tree induction Knowledge Discovery in Databases Medical informatics Medicinsk informatik

Search results