Spelling suggestions: "subject:"associative classification"" "subject:"associative 1classification""
11 |
Análise de desempenho dos algoritmos Apriori e Fuzzy Apriori na extração de regras de associação aplicados a um Sistema de Detecção de Intrusos. / Performance analysis of algorithms Apriori and Fuzzy Apriori in association rules mining applied to a System for Intrusion Detection.Ricardo Ferreira Vieira de Castro 20 February 2014 (has links)
A extração de regras de associação (ARM - Association Rule Mining) de dados quantitativos tem sido pesquisa de grande interesse na área de mineração de dados. Com o crescente aumento das bases de dados, há um grande investimento na área de pesquisa na criação de algoritmos para melhorar o desempenho relacionado a quantidade de regras, sua relevância e a performance computacional. O algoritmo APRIORI, tradicionalmente usado na extração de regras de associação, foi criado originalmente para trabalhar com atributos categóricos. Geralmente, para usá-lo com atributos contínuos, ou quantitativos, é necessário transformar os atributos contínuos, discretizando-os e, portanto, criando categorias a partir dos intervalos discretos. Os métodos mais tradicionais de discretização produzem intervalos com fronteiras sharp, que podem subestimar ou superestimar elementos próximos dos limites das partições, e portanto levar a uma representação imprecisa de semântica. Uma maneira de tratar este problema é criar partições soft, com limites suavizados. Neste trabalho é utilizada uma partição fuzzy das variáveis contínuas, que baseia-se na teoria dos conjuntos fuzzy e transforma os atributos quantitativos em partições de termos linguísticos. Os algoritmos de mineração de regras de associação fuzzy (FARM - Fuzzy Association Rule Mining) trabalham com este princípio e, neste trabalho, o algoritmo FUZZYAPRIORI, que pertence a esta categoria, é utilizado. As regras extraídas são expressas em termos linguísticos, o que é mais natural e interpretável pelo raciocício humano. Os algoritmos APRIORI tradicional e FUZZYAPRIORI são comparado, através de classificadores associativos, baseados em regras extraídas por estes algoritmos. Estes classificadores foram aplicados em uma base de dados relativa a registros de conexões TCP/IP que destina-se à criação de um Sistema de Detecção de Intrusos. / The mining of association rules of quantitative data has been of great research interest in the area of data mining. With the increasing size of databases, there is a large investment in research in creating algorithms to improve performance related to the amount of rules, its relevance and computational performance. The APRIORI algorithm, traditionally used in the extraction of association rules, was originally created to work with categorical attributes. In order to use continuous attributes, it is necessary to transform the continuous attributes, through discretization, into categorical attributes, where each categorie corresponds to a discrete interval. The more traditional discretization methods produce intervals with sharp boundaries, which may underestimate or overestimate elements near the boundaries of the partitions, therefore inducing an inaccurate semantical representation. One way to address this problem is to create soft partitions with smoothed boundaries. In this work, a fuzzy partition of continuous variables, which is based on fuzzy set theory is used. The algorithms for mining fuzzy association rules (FARM - Fuzzy Association Rule Mining) work with this principle, and, in this work, the FUZZYAPRIORI algorithm is used. In this dissertation, we compare the traditional APRIORI and the FUZZYAPRIORI, through classification results of associative classifiers based on rules extracted by these algorithms. These classifiers were applied to a database of records relating to TCP / IP connections that aims to create an Intrusion Detection System.
|
12 |
Enhancing Fuzzy Associative Rule Mining Approaches for Improving Prediction Accuracy. Integration of Fuzzy Clustering, Apriori and Multiple Support Approaches to Develop an Associative Classification Rule BaseSowan, Bilal I. January 2011 (has links)
Building an accurate and reliable model for prediction for different application domains, is one of the most significant challenges in knowledge discovery and data mining. This thesis focuses on building and enhancing a generic predictive model for estimating a future value by extracting association rules (knowledge) from a quantitative database. This model is applied to several data sets obtained from different benchmark problems, and the results are evaluated through extensive experimental tests.
The thesis presents an incremental development process for the prediction model with three stages. Firstly, a Knowledge Discovery (KD) model is proposed by integrating Fuzzy C-Means (FCM) with Apriori approach to extract Fuzzy Association Rules (FARs) from a database for building a Knowledge Base (KB) to predict a future value. The KD model has been tested with two road-traffic data sets.
Secondly, the initial model has been further developed by including a diversification method in order to improve a reliable FARs to find out the best and representative rules. The resulting Diverse Fuzzy Rule Base (DFRB) maintains high quality and diverse FARs offering a more reliable and generic model. The model uses FCM to transform quantitative data into fuzzy ones, while a Multiple Support Apriori (MSapriori) algorithm is adapted to extract the FARs from fuzzy data. The correlation values for these FARs are calculated, and an efficient orientation for filtering FARs is performed as a post-processing method. The FARs diversity is maintained through the clustering of FARs, based on the concept of the sharing function technique used in multi-objectives optimization. The best and the most diverse FARs are obtained as the DFRB to utilise within the Fuzzy Inference System (FIS) for prediction.
The third stage of development proposes a hybrid prediction model called Fuzzy Associative Classification Rule Mining (FACRM) model. This model integrates the
ii
improved Gustafson-Kessel (G-K) algorithm, the proposed Fuzzy Associative Classification Rules (FACR) algorithm and the proposed diversification method. The improved G-K algorithm transforms quantitative data into fuzzy data, while the FACR generate significant rules (Fuzzy Classification Association Rules (FCARs)) by employing the improved multiple support threshold, associative classification and vertical scanning format approaches. These FCARs are then filtered by calculating the correlation value and the distance between them. The advantage of the proposed FACRM model is to build a generalized prediction model, able to deal with different application domains. The validation of the FACRM model is conducted using different benchmark data sets from the University of California, Irvine (UCI) of machine learning and KEEL (Knowledge Extraction based on Evolutionary Learning) repositories, and the results of the proposed FACRM are also compared with other existing prediction models. The experimental results show that the error rate and generalization performance of the proposed model is better in the majority of data sets with respect to the commonly used models.
A new method for feature selection entitled Weighting Feature Selection (WFS) is also proposed. The WFS method aims to improve the performance of FACRM model. The prediction performance is improved by minimizing the prediction error and reducing the number of generated rules. The prediction results of FACRM by employing WFS have been compared with that of FACRM and Stepwise Regression (SR) models for different data sets. The performance analysis and comparative study show that the proposed prediction model provides an effective approach that can be used within a decision support system. / Applied Science University (ASU) of Jordan
|
13 |
Efficient Frequent Closed Itemset Algorithms With Applications To Stream Mining And ClassificationRanganath, B N 09 1900 (has links)
Data mining is an area to find valid, novel, potentially useful, and ultimately understandable abstractions in a data. Frequent itemset mining is one of the important data mining approaches to find those abstractions in the form of patterns. Frequent Closed itemsets provide complete and condensed information for non-redundant association rules generation. For many applications mining all the frequent itemsets is not necessary, and mining frequent Closed itemsets are adequate. Compared to frequent itemset mining, frequent Closed itemset mining generates less number of itemsets, and therefore improves the efficiency and effectiveness of these tasks.
Recently, much research has been done on Closed itemsets mining, but it is mainly for traditional databases where multiple scans are needed, and whenever new transactions arrive, additional scans must be performed on the updated transaction database; therefore, they are not suitable for data stream mining.
Mining frequent itemsets from data streams has many potential and broad applications. Some of the emerging applications of data streams that require association rule mining are network traffic monitoring and web click streams analysis. Different from data in traditional static databases, data streams typically arrive continuously in high speed with huge amount and changing data distribution. This raises new issues that need to be considered when developing association rule mining techniques for stream data.
Recent works on data stream mining based on sliding window method slide the window by one transaction at a time. But when the window size is large and support threshold is low, the existing methods consume significant time and lead to a large increase in user response time.
In our first work, we propose a novel algorithm Stream-Close based on sliding window model to mine frequent Closed itemsets from the data streams within the current sliding window. We enhance the scalabality of the algorithm by introducing several optimization techniques such as sliding the window by multiple transactions at a time and novel pruning techniques which lead to a considerable reduction in the number of candidate itemsets to be examined for closure checking. Our experimental studies show that the proposed algorithm scales well with large data sets.
Still the notion of frequent closed itemsets generates a huge number of closed itemsets in some applications. This drawback makes frequent closed itemsets mining infeasible in many applications since users cannot interpret the large volume of output (which sometimes will be greater than the data itself when support threshold is low) and may lead to an overhead to develop extra applications which post processes the output of original algorithm to reduce the size of the output.
Recent work on clustering of itemsets considers strictly either expression(consists of items present in itemset) or support of the itemsets or partially both to reduce the number of itemsets. But the drawback of the above approaches is that in some situations, number of itemsets does not reduce due to their restricted view of either considering expressions or support.
So we propose a new notion of frequent itemsets called clustered itemsets which considers both expressions and support of the itemsets in summarizing the output. We introduce a new distance measure w.r.t expressions and also prove the problem of mining clustered itemsets to be NP-hard.
In our second work, we propose a deterministic locality sensitive hashing based classifier using clustered itemsets. Locality sensitive hashing(LSH)is a technique for efficiently finding a nearest neighbour in high dimensional data sets. The idea of locality sensitive hashing is to hash the points using several hash functions to ensure that for each function the probability of collision is much higher for objects that are close to each other than those that are far apart. We propose a LSH based approximate nearest neighbour classification strategy. But the problem with LSH is, it randomly chooses hash functions and the estimation of a large number of hash functions could lead to an increase in query time. From Classification point of view, since LSH chooses randomly from a family of hash functions the buckets may contain points belonging to other classes which may affect classification accuracy. So, in order to overcome these problems we propose to use class association rules based hash functions which ensure that buckets corresponding to the class association rules contain points from the same class. But associative classification involves generation and examination of large number of candidate class association rules. So, we use the clustered itemsets which reduce the number of class association rules to be examined. We also establish formal connection between clustering parameter(delta used in the generation of clustered frequent itemsets) and discriminative measure such as Information gain. Our experimental studies show that the proposed method achieves an increase in accuracy over LSH based near neighbour classification strategy.
|
Page generated in 0.1374 seconds