Spelling suggestions: "subject:"data minining"" "subject:"data chanining""
321 |
Business Intelligence : Lösungen im ÜberblickEggert, Sandy, Meier, Juliane January 2010 (has links)
No description available.
|
322 |
On Data Mining and Classification Using a Bayesian Confidence Propagation Neural NetworkOrre, Roland January 2003 (has links)
The aim of this thesis is to describe how a statisticallybased neural network technology, here named BCPNN (BayesianConfidence Propagation Neural Network), which may be identifiedby rewriting Bayes' rule, can be used within a fewapplications, data mining and classification with credibilityintervals as well as unsupervised pattern recognition. BCPNN is a neural network model somewhat reminding aboutBayesian decision trees which are often used within artificialintelligence systems. It has previously been success- fullyapplied to classification tasks such as fault diagnosis,supervised pattern recognition, hiearchical clustering and alsoused as a model for cortical memory. The learning paradigm usedin BCPNN is rather different from many other neural networkarchitectures. The learning in, e.g. the popularbackpropagation (BP) network, is a gradient method on an errorsurface, but learning in BCPNN is based upon calculations ofmarginal and joint prob- abilities between attributes. This isa quite time efficient process compared to, for instance,gradient learning. The interpretation of the weight values inBCPNN is also easy compared to many other networkarchitechtures. The values of these weights and theiruncertainty is also what we are focusing on in our data miningapplication. The most important results and findings in thisthesis can be summarised in the following points: We demonstrate how BCPNN (Bayesian Confidence PropagationNeural Network) can be extended to model the uncertainties incollected statistics to produce outcomes as distributionsfrom two different aspects: uncertainties induced by sparsesampling, which is useful for data mining; uncertainties dueto input data distributions, which is useful for processmodelling. We indicate how classification with BCPNN gives highercertainty than an optimal Bayes classifier and betterprecision than a naïve Bayes classifier for limited datasets. We show how these techniques have been turned into auseful tool for real world applications within the drugsafety area in particular. We present a simple but working method for doingautomatic temporal segmentation of data sequences as well asindicate some aspects of temporal tasks for which a Bayesianneural network may be useful. We present a method, based on recurrent BCPNN, whichperforms a similar task as an unsupervised clustering method,on a large database with noisy incomplete data, but muchquicker, with an efficiency in finding patterns comparablewith a well known (Autoclass) Bayesian clustering method,when we compare their performane on artificial data sets.Apart from BCPNN being able to deal with really large datasets, because it is a global method working on collectivestatistics, we also get good indications that the outcomefrom BCPNN seems to have higher clinical relevance thanAutoclass in our application on the WHO database of adversedrug reactions and therefore is a relevant data mining toolto use on the WHO database. Artificial neural network, Bayesian neural network, datamining, adverse drug reaction signalling, classification,learning.
|
323 |
Aggregation and Privacy in Multi-Relational DatabasesJafer, Yasser 11 April 2012 (has links)
Most existing data mining approaches perform data mining tasks on a single data table. However, increasingly, data repositories such as financial data and medical records, amongst others, are stored in relational databases. The inability of applying traditional data mining techniques directly on such relational database thus poses a serious challenge. To address this issue, a number of researchers convert a relational database into one or more flat files and then apply traditional data mining algorithms. The above-mentioned process of transforming a relational database into one or more flat files usually involves aggregation. Aggregation functions such as maximum, minimum, average, standard deviation, count and sum are commonly used in such a flattening process.
Our research aims to address the following question: Is there a link between aggregation and possible privacy violations during relational database mining? In this research we investigate how, and if, applying aggregation functions will affect the privacy of a relational database, during supervised learning, or classification, where the target concept is known. To this end, we introduce the PBIRD (Privacy Breach Investigation in Relational Databases) methodology. The PBIRD methodology combines multi-view learning with feature selection, to discover the potentially dangerous sets of features as hidden within a database. Our approach creates a number of views, which consist of subsets of the data, with and without aggregation. Then, by identifying and investigating the set of selected features in each view, potential privacy breaches are detected. In this way, our PBIRD algorithm is able to discover those features that are correlated with the classification target that may also lead to revealing of sensitive information in the database.
Our experimental results show that aggregation functions do, indeed, change the correlation between attributes and the classification target. We show that with aggregation, we obtain a set of features which can be accurately linked to the classification target and used to predict (with high accuracy) the confidential information. On the other hand, the results show that, without aggregation we obtain another different set of potentially harmful features. By identifying the complete set of potentially dangerous attributes, the PBIRD methodology provides a solution where the database designers/owners can be warned, to subsequently perform necessary adjustments to protect the privacy of the relational database.
In our research, we also perform a comparative study to investigate the impact of aggregation on the classification accuracy and on the time required to build the models. Our results suggest that in the case where a database consists only of categorical data, aggregation should especially be used with caution. This is due to the fact that aggregation causes a decrease in overall accuracies of the resulting models. When the database contains mixed attributes, the results show that the accuracies without aggregation and with aggregation are comparable. However, even in such scenarios, schemas without aggregation tend to slightly outperform. With regard to the impact of aggregation on the model building time, the results show that, in general, the models constructed with aggregation require shorter building time. However, when the database is small and consists of nominal attributes with high cardinality, aggregation causes a slower model building time.
|
324 |
Microarray analysis using pattern discoveryBainbridge, Matthew Neil 10 December 2004
Analysis of gene expression microarray data has traditionally been conducted using hierarchical clustering. However, such analysis has many known disadvantages and pattern discovery (PD) has been proposed as an alternative technique. In this work, three similar but different PD algorithms Teiresias, Splash and Genes@Work were benchmarked for time and memory efficiency on a small yeast cell-cycle data set. Teiresias was found to be the fastest, and best over-all program. However, Splash was more memory efficient. This work also investigated the performance of four methods of discretizing microarray data: sign-of-the-derivative, K-means, pre-set value, and Genes@Work stratification. The first three methods were evaluated on their predisposition to group together biologically related genes. On a yeast cell-cycle data set, sign-of-the-derivative method yielded the most biologically significant patterns, followed by the pre-set value and K-means methods. K-means, preset-value, and Genes@Work were also compared on their ability to classify tissue samples from diffuse large b-cell lymphoma (DLBCL) into two subtypes determined by standard techniques. The Genes@Work stratification method produced the best patterns for discriminating between the two subtypes of lymphoma. However, the results from the second-best method, K-means, call into question the accuracy of the classification by the standard technique. Finally, a number of recommendations for improvement of pattern discovery algorithms and discretization techniques are made.
|
325 |
Prediction of Protein-protein Interactions and Essential Genes through Data IntegrationKotlyar, Max 31 August 2011 (has links)
The currently known network of human protein-protein interactions (PPIs) is providing new insights into diseases and helping to identify potential therapies. However, according to several estimates, the known interaction network may represent only 10% of the entire interactome - indicating that more comprehensive knowledge of the interactome could have a major impact on understanding and treating diseases. The primary aim of this thesis was to develop computational methods to provide increased coverage of the interactome. A secondary aim was to gain a better understanding of the link between networks and phenotype, by analyzing essential mouse genes.
Two algorithms were developed to predict PPIs and provide increased coverage of the interactome: FpClass and mixed co-expression. FpClass differs from previous PPI prediction methods in two key ways: it integrates both positive and negative evidence for protein interactions, and it identifies synergies between predictive features. Through these approaches FpClass provides interaction networks with significantly improved reliability and interactome coverage. Compared to previous predicted human PPI networks, FpClass provides a network with over 10 times more interactions, about 2 times more proteins and a lower false discovery rate. This network includes 595 disease related proteins from OMIM and Cancer Gene Census which have no previously known interactions. The second method, mixed co-expression, aims to predict transient PPIs, which have proven difficult to detect by computational and experimental methods. Mixed co-expression makes predictions using gene co-expression and performs significantly better (p < 0.05) than the previous method for predicting PPIs from co-expression. It is especially effective for identifying interactions of transferases and signal transduction proteins.
For the second aim of the thesis, we investigated the relationship between gene essentiality and diverse gene/protein features based on gene expression, PPI and gene co-expression networks, gene/protein sequence, Gene Ontology, and orthology. We identified non-redundant features closely associated with essentiality, including centrality in PPI and gene co-expression networks. We found that no single predictive feature was effective for all essential genes; most features, including centrality, were less effective for genes associated with postnatal lethality and infertility. These results suggest that understanding phenotype will require integrating measures of network topology with information about the biology of the network’s nodes and edges.
|
326 |
Prediction of Protein-protein Interactions and Essential Genes through Data IntegrationKotlyar, Max 31 August 2011 (has links)
The currently known network of human protein-protein interactions (PPIs) is providing new insights into diseases and helping to identify potential therapies. However, according to several estimates, the known interaction network may represent only 10% of the entire interactome - indicating that more comprehensive knowledge of the interactome could have a major impact on understanding and treating diseases. The primary aim of this thesis was to develop computational methods to provide increased coverage of the interactome. A secondary aim was to gain a better understanding of the link between networks and phenotype, by analyzing essential mouse genes.
Two algorithms were developed to predict PPIs and provide increased coverage of the interactome: FpClass and mixed co-expression. FpClass differs from previous PPI prediction methods in two key ways: it integrates both positive and negative evidence for protein interactions, and it identifies synergies between predictive features. Through these approaches FpClass provides interaction networks with significantly improved reliability and interactome coverage. Compared to previous predicted human PPI networks, FpClass provides a network with over 10 times more interactions, about 2 times more proteins and a lower false discovery rate. This network includes 595 disease related proteins from OMIM and Cancer Gene Census which have no previously known interactions. The second method, mixed co-expression, aims to predict transient PPIs, which have proven difficult to detect by computational and experimental methods. Mixed co-expression makes predictions using gene co-expression and performs significantly better (p < 0.05) than the previous method for predicting PPIs from co-expression. It is especially effective for identifying interactions of transferases and signal transduction proteins.
For the second aim of the thesis, we investigated the relationship between gene essentiality and diverse gene/protein features based on gene expression, PPI and gene co-expression networks, gene/protein sequence, Gene Ontology, and orthology. We identified non-redundant features closely associated with essentiality, including centrality in PPI and gene co-expression networks. We found that no single predictive feature was effective for all essential genes; most features, including centrality, were less effective for genes associated with postnatal lethality and infertility. These results suggest that understanding phenotype will require integrating measures of network topology with information about the biology of the network’s nodes and edges.
|
327 |
Aggregation and Privacy in Multi-Relational DatabasesJafer, Yasser 11 April 2012 (has links)
Most existing data mining approaches perform data mining tasks on a single data table. However, increasingly, data repositories such as financial data and medical records, amongst others, are stored in relational databases. The inability of applying traditional data mining techniques directly on such relational database thus poses a serious challenge. To address this issue, a number of researchers convert a relational database into one or more flat files and then apply traditional data mining algorithms. The above-mentioned process of transforming a relational database into one or more flat files usually involves aggregation. Aggregation functions such as maximum, minimum, average, standard deviation, count and sum are commonly used in such a flattening process.
Our research aims to address the following question: Is there a link between aggregation and possible privacy violations during relational database mining? In this research we investigate how, and if, applying aggregation functions will affect the privacy of a relational database, during supervised learning, or classification, where the target concept is known. To this end, we introduce the PBIRD (Privacy Breach Investigation in Relational Databases) methodology. The PBIRD methodology combines multi-view learning with feature selection, to discover the potentially dangerous sets of features as hidden within a database. Our approach creates a number of views, which consist of subsets of the data, with and without aggregation. Then, by identifying and investigating the set of selected features in each view, potential privacy breaches are detected. In this way, our PBIRD algorithm is able to discover those features that are correlated with the classification target that may also lead to revealing of sensitive information in the database.
Our experimental results show that aggregation functions do, indeed, change the correlation between attributes and the classification target. We show that with aggregation, we obtain a set of features which can be accurately linked to the classification target and used to predict (with high accuracy) the confidential information. On the other hand, the results show that, without aggregation we obtain another different set of potentially harmful features. By identifying the complete set of potentially dangerous attributes, the PBIRD methodology provides a solution where the database designers/owners can be warned, to subsequently perform necessary adjustments to protect the privacy of the relational database.
In our research, we also perform a comparative study to investigate the impact of aggregation on the classification accuracy and on the time required to build the models. Our results suggest that in the case where a database consists only of categorical data, aggregation should especially be used with caution. This is due to the fact that aggregation causes a decrease in overall accuracies of the resulting models. When the database contains mixed attributes, the results show that the accuracies without aggregation and with aggregation are comparable. However, even in such scenarios, schemas without aggregation tend to slightly outperform. With regard to the impact of aggregation on the model building time, the results show that, in general, the models constructed with aggregation require shorter building time. However, when the database is small and consists of nominal attributes with high cardinality, aggregation causes a slower model building time.
|
328 |
A computational environment for mining association rules and frequent item setsHahsler, Michael, Grün, Bettina, Hornik, Kurt January 2005 (has links) (PDF)
Mining frequent itemsets and association rules is a popular and well researched approach to discovering interesting relationships between variables in large databases. The R package arules presented in this paper provides a basic infrastructure for creating and manipulating input data sets and for analyzing the resulting itemsets and rules. The package also includes interfaces to two fast mining algorithms, the popular C implementations of Apriori and Eclat by Christian Borgelt. These algorithms can be used to mine frequent itemsets, maximal frequent itemsets, closed frequent itemsets and association rules. (author's abstract) / Series: Research Report Series / Department of Statistics and Mathematics
|
329 |
Clustering Lab value working with medical dataDavari, Mahtab January 2007 (has links)
Data mining is a relatively new field of research that its objective is to acquire knowledge from large amounts of data. In medical and health care areas, due to regulations and due to the availability of computers, a large amount of data is becoming available [27]. On the one hand, practitioners are expected to use all this data in their work but, at the same time, such a large amount of data cannot be processed by humans in a short time to make diagnosis, prognosis and treatment schedules. A major objective of this thesis is to evaluate data mining tools in medical and health care applications to develop a tool that can help make rather accurate decisions. In this thesis, the goal is finding a pattern among patients who got pneumonia by clustering of lab data values which have been recorded every day. By this pattern we can generalize it to the patients who did not have been diagnosed by this disease whose lab values shows the same trend as pneumonia patients does. There are 10 tables which have been extracted from a big data base of a hospital in Jena for my work .In ICU (intensive care unit), COPRA system which is a patient management system has been used. All the tables and data stored in German Language database.
|
330 |
Unsupervised learning to cluster the disease stages in parkinson's diseaseSrinivasan, BadriNarayanan January 2011 (has links)
Parkinson's disease (PD) is the second most common neurodegenerative disorder (after Alzheimer's disease) and directly affects upto 5 million people worldwide. The stages (Hoehn and Yaar) of disease has been predicted by many methods which will be helpful for the doctors to give the dosage according to it. So these methods were brought up based on the data set which includes about seventy patients at nine clinics in Sweden. The purpose of the work is to analyze unsupervised technique with supervised neural network techniques in order to make sure the collected data sets are reliable to make decisions. The data which is available was preprocessed before calculating the features of it. One of the complex and efficient feature called wavelets has been calculated to present the data set to the network. The dimension of the final feature set has been reduced using principle component analysis. For unsupervised learning k-means gives the closer result around 76% while comparing with supervised techniques. Back propagation and J4 has been used as supervised model to classify the stages of Parkinson's disease where back propagation gives the variance percentage of 76-82%. The results of both these models have been analyzed. This proves that the data which are collected are reliable to predict the disease stages in Parkinson's disease.
|
Page generated in 0.1354 seconds