• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 20
  • 3
  • 2
  • 1
  • Tagged with
  • 31
  • 23
  • 19
  • 19
  • 13
  • 11
  • 7
  • 5
  • 5
  • 5
  • 5
  • 5
  • 4
  • 4
  • 4
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.

A Set-Checking Algorithm for Mining Maximal Frequent Itemsets from Data Streams

Lin, Pei-Ying 15 July 2011 (has links)
Online mining the maximal frequent itemsets over data streams is an important problem in data mining. The maximal frequent itemset is the itemset which the support is large or equal to the minimal support and the itemset is not the subset or superse of each itemset. Previous algorithms to mine the maximal frequent itemsets in the traditional database are not suitable for data streams. Because data streams have some characteristics: (1) continuous (2) fast (3) no data limit (4) real time (5) searching once, mining data streams have many new challenges. First, they are unrealistic to keep the entire stream in the main memory or even in a secondary storage area, since a data stream comes continuously and the amount of data is unbounded. Second, traditional methods of mining on stored datasets by multiple scans are infeasible, since the streaming data is passed only once. Third, mining streams requires fast, real-time processing in order to keep up with the high data arrival rate and mining results are expected to be available within short response time. In order to solve mining maximal frequent itemsets from data streams using the landmark window model, Mao et. al. propose the INSTANT algorithm. In the landmark window model, knowledge discovery is performed based on the values between the beginning time and the present. The advantage of using the landmark window model is that the results are correct as compared to the other models. The structure of the INSTANT algorithm is simple and it can save many memory space. But it takes long time in mining the maximal frequent itemsets. When the new transactions comes, the number of comparisons between the old transactions of INSATNT algorithm is too much. In this thesis, we propose the Set-Checking algorithm to mine frequent itemsets from data streams using the landmark window model. We use the structure of lattice to store our information. The structure of lattice records the subset relationship between the child node and the father node. For every node, we can record the itemset and the support. When the new transaction comes, we consider five relations: (1) equivalent (2) superset (3) subset (4) intersection (5) empty relations. According to the lattice structure of the five sets , we can add the transaction and the renew support efficiently. From our simulation result, we find that the process time of our Set-Checking algorithm is faster than that of the INSTANT algorithm.

A Subset-Lattice Algorithm for Mining Maximal Frequent Itemsets over a Data Stream Sliding Window

Wang, Syuan-Yun 09 July 2012 (has links)
Online mining association rules in data streams is an important field in the data mining. Among them, mining the maximal frequent itemsets is also an important issue. A frequent itemset is called maximal if it is not a subset of any other frequent itemset. The set of all the maximal frequent itemsets is denoted as the maximal frequent itemset. Because data streams are continuous, high speed, unbounded, and real time. As a result, we can only scan once for the data streams. Therefore, the previous algorithms to mine the maximal frequent itemsets in the traditional databases are not suitable for the data streams. Furthermore, many applications are interested in the recent data streams, and the sliding window is the model which deal with the most recent data streams. In the sliding window model, a window size is required. One of the algorithms for mining the maximal frequent itemsets based on the sliding window model is called the MFIoSSW algorithm. The MFIoSSW algorithm uses a compact structure to mine the maximal frequent itemsets. It uses an array-based structure A to store the maximal frequent itemsets and other helpful itemsets. But it takes long time to mine the maximal frequent itemsets. When the new transaction comes, the number of comparison between the new transaction and the old transactions is too much. Therefore, in this project, we propose a sliding window approach, the Subset-Lattice algorithm. We use the lattice structure to store the information of the transactions. The structure of the lattice stores the relationship between the child node and the father node. In each node, we record the itemset and the support. When the new transaction comes, we consider five relations: (1) equivalent, (2) subset, (3) intersection, (4) empty set, (5) superset. With this five relations, we can add the new transactions and update the support efficiently.

An Efficient Subset-Lattice Algorithm for Mining Closed Frequent Itemsets in Data Streams

Peng, Wei-hau 25 June 2009 (has links)
Online mining association rules over data streams is an important issue in the area of data mining, where an association rule means that the presence of some items in a transaction will imply the presence of other items in the same transaction. There are many applications of using association rules in data streams, such as market analysis, network security, sensor networks and web tracking. Mining closed frequent itemsets is a further work of mining association rules, which aims to find the subsets of frequent itemsets that could extract all frequent itemsets. Formally, a closed frequent itemset is an frequent itemset which has no superset with the same support as it. Since data streams are continuous, high-speed, and unbounded, archiving everything from data streams is impossible. That is, we can only scan once for the data streams and it is a main-memory database. Therefore, previous algorithms to mine closed frequent itemsets in the traditional database are not suitable for data streams. On the other hand, many applications are interested in the most recent data, and there is a model to deal with the most recent data in data streams, called emph{Sliding Window Model}, which acquires the recent data with a window size meets this characteristic. One of well-known algorithms for mining closed frequent itemsets which based on the sliding window model is the NewMoment algorithm. However, the NewMoment algorithm could not efficiently mine closed frequent itemsets in data streams, since they will generate closed frequent itemsets and many unclosed frequent itemsets. Moreover, when data in the sliding window is incrementally updated, the NewMoment algorithm needs to reconstruct the whole tree structure. Therefore, in this thesis, we propose a sliding window approach, the Subset-Lattice algorithm, which embeds the subset property into the lattice structure to efficiently mine closed frequent itemsets. Basically, Our proposed algorithm considers five kinds of set concepts : (1) equivalent, (2) superset, (3) subset, (4) intersection, (5) empty relation, when data items are inserted. We judge closed frequent itemsets without generating unclosed frequent itemsets by these five kinds of set concepts. Moreover, when data in the sliding window is incrementally updated, our Subset-Lattice algorithm will not reconstruct the whole lattice structure. Therefore, our Subset-Lattice algorithm is more efficient than the Moment algorithm. Furthermore, we use the bit-pattern to represent the itemsets, and use bit-operations to speed up the set-checking. From our simulation results, we show that our Subset-Lattice algorithm needs less memory and less processing time than the NewMoment algorithm. When window slides, the execution time could be saved up to 50\%.

Fouille de représentations concises des motifs fréquents à travers les espaces de recherche conjonctif et disjonctif

Hamrouni, Tarek 04 August 2009 (has links) (PDF)
Durant ces dernières années, les quantités de données collectées, dans divers domaines d'application de l'informatique, deviennent de plus en plus importantes. Cela suscite le besoin d'analyser et d'interpréter ces données afin d'en extraire des connaissances utiles. Dans cette situation, le processus d'Extraction de Connaissances à partir des Données est un processus complet visant à extraire des connaissances cachées, nouvelles et potentiellement utiles à partir de grands volumes de données. Parmi ces étapes, la fouille de données offre les outils et techniques permettant une telle extraction. Plusieurs travaux de recherche en fouille de données concernent la découverte des règles d'association, permettant d'identifier des liens entre ensembles de descripteurs (ou attributs ou items) décrivant un ensemble d'objets (ou individus ou transactions). Les règles d'association ont montré leur utilité dans plusieurs domaines d'application tels que la gestion de la relation client en grande distribution (analyse du panier de la ménagère pour déterminer les produits souvent achetés simultanément, et agencer les rayons et organiser les promotions en conséquence), la biologie moléculaire (analyse des associations entre gènes), etc. De manière générale, la construction des règles d'association s'effectue en deux étapes : l'extraction des ensembles d'items (ou itemsets) fréquents, puis la génération des règles d'association à partir de des itemsets fréquents. Dans la pratique, le nombre de motifs (itemsets fréquents ou règles d'associations) extraits ou générés, peut être très élevé, ce qui rend difficile leur exploitation pertinente par les utilisateurs. Pour pallier ce problème, certains travaux de recherche proposent l'usage d'un noyau de motifs, appelés représentations concises, à partir desquels les motifs redondants peuvent être régénérés. Le but de telles représentations est de condenser les motifs extraits tout en préservant autant que possible les informations cachées et intéressantes sur des données. Dans la littérature, beaucoup de représentations concises des motifs fréquents ont été proposées, explorant principalement l'espace de recherche conjonctif. Dans cet espace, les itemsets sont caractérisés par la fréquence de leur co-occurrence. Ceci fait l'objet de la première partie de ce travail. Une étude détaillée proposée dans cette thèse prouve que les itemsets fermés et les générateurs minimaux sont un moyen de représenter avec concision les itemsets fréquents et les règles d'association. Les itemsets fermés structurent l'espace de recherche dans des classes d'équivalence tels que chaque classe regroupe les itemsets apparaissant dans le même sous-ensemble (appelé aussi objets ou transactions) des données. Un itemset fermé inclut l'expression la plus spécifique décrivant les transactions associées, alors qu'un générateur minimal inclut une des expressions les plus générales. Cependant, une redondance combinatoire intra-classe résulte logiquement de l'absence inhérente d'un seul générateur minimal associé à un itemset fermé donné. Ceci nous amotivé à effectuer une étude approfondie visant à maintenir seulement les générateurs minimaux irréductibles dans chaque classe d'équivalence, et d'élaguer les autres. À cet égard, il est proposé une réduction sans perte d'information de l'ensemble des générateurs minimaux grâce à un nouveau processus basé sur la substitution. Une étude complète des propriétés associées aux familles obtenues est présentée. Les résultats théoriques sont ensuite étendus au cadre de règles d'association afin de réduire autant que possible le nombre de règles maintenues sans perte d'information. Puis, est présentée une étude formelle complète du mécanisme d'inférence permettant de dériver toutes les règles d'association redondantes, à partir de celles maintenues. Afin de valider l'approche proposée, les algorithmes de construction de ces représentations concises de motifs sont présentés ainsi que les résultats des expérimentations réalisées en terme de concision et de temps de calcul. La seconde partie de ce travail est consacrée à une exploration complète de l'espace de recherche disjonctif des itemsets, où ceux-ci sont caractérisés par leurs supports disjonctifs. Ainsi dans l'espace disjonctif, un itemset vérifie une transaction si au moins un de ses items y est présent. Les itemsets disjonctifs véhiculent ainsi une connaissance au sujet des occurrences complémentaires d'items dans un ensemble de données. Cette exploration est motivée par le fait que, dans certaines applications, une telle information peut être utile aux utilisateurs. Lors de l'analyse d'une séquence génétique par exemple, le fait d'engendrer une information telle que " présence d'un gène X ou la présence d'un gène Y ou ... " présente un intérêt pour le biologiste. Afin d'obtenir une représentation concise de l'espace de recherche disjonctif, une solution intéressante consiste à choisir un seul élément pour représenter les itemsets couvrant le même ensemble de données. Deux itemsets sont équivalents si leurs items respectifs couvrent le même ensemble de données. À cet égard, un nouvel opérateur consacré à cette tâche, a été introduit. Dans chaque classe d'équivalence induite, les éléments minimaux sont appelés itemsets essentiels, alors que le plus grand élément est appelé itemset fermé disjonctif. L'opérateur présenté est alors à la base de nouvelles représentations concises des itemsets fréquents. L'espace de recherche disjonctif est ensuite exploité pour dériver des règles d'association généralisées. Ces dernières règles généralisent les règles classiques pour offrir également des connecteurs de disjonction et de négation d'items, en plus de celui conjonctif. Des outils (algorithme et programme) dédiés ont été alors conçus et mis en application pour extraire les itemsets disjonctifs et les règles d'association généralisées. Les résultats des expérimentations effectuées ont montré l'utilité de notre exploration et ont mis en valeur la concision des représentations concises proposées.

Fouille de représentations concises des motifs fréquents à travers les espaces de recherche conjonctif et disjonctif / Mining concise representations of frequent patterns through conjunctive and disjunctive search spaces

Hamrouni, Tarek 04 August 2009 (has links)
Durant ces dernières années, les quantités de données collectées, dans divers domaines d'application de l'informatique, deviennent de plus en plus importantes. Cela suscite le besoin d'analyser et d'interpréter ces données afin d'en extraire des connaissances utiles. Dans cette situation, le processus d'Extraction de Connaissances à partir des Données est un processus complet visant à extraire des connaissances cachées, nouvelles et potentiellement utiles à partir de grands volumes de données. Parmi ces étapes, la fouille de données offre les outils et techniques permettant une telle extraction. Plusieurs travaux de recherche en fouille de données concernent la découverte des règles d'association, permettant d'identifier des liens entre ensembles de descripteurs (ou attributs ou items) décrivant un ensemble d'objets (ou individus ou transactions). Les règles d'association ont montré leur utilité dans plusieurs domaines d'application tels que la gestion de la relation client en grande distribution (analyse du panier de la ménagère pour déterminer les produits souvent achetés simultanément, et agencer les rayons et organiser les promotions en conséquence), la biologie moléculaire (analyse des associations entre gènes), etc. De manière générale, la construction des règles d'association s'effectue en deux étapes : l'extraction des ensembles d'items (ou itemsets) fréquents, puis la génération des règles d'association à partir de des itemsets fréquents. Dans la pratique, le nombre de motifs (itemsets fréquents ou règles d'associations) extraits ou générés, peut être très élevé, ce qui rend difficile leur exploitation pertinente par les utilisateurs. Pour pallier ce problème, certains travaux de recherche proposent l'usage d'un noyau de motifs, appelés représentations concises, à partir desquels les motifs redondants peuvent être régénérés. Le but de telles représentations est de condenser les motifs extraits tout en préservant autant que possible les informations cachées et intéressantes sur des données. Dans la littérature, beaucoup de représentations concises des motifs fréquents ont été proposées, explorant principalement l'espace de recherche conjonctif. Dans cet espace, les itemsets sont caractérisés par la fréquence de leur co-occurrence. Ceci fait l'objet de la première partie de ce travail. Une étude détaillée proposée dans cette thèse prouve que les itemsets fermés et les générateurs minimaux sont un moyen de représenter avec concision les itemsets fréquents et les règles d'association. Les itemsets fermés structurent l'espace de recherche dans des classes d'équivalence tels que chaque classe regroupe les itemsets apparaissant dans le même sous-ensemble (appelé aussi objets ou transactions) des données. Un itemset fermé inclut l'expression la plus spécifique décrivant les transactions associées, alors qu'un générateur minimal inclut une des expressions les plus générales. Cependant, une redondance combinatoire intra-classe résulte logiquement de l'absence inhérente d'un seul générateur minimal associé à un itemset fermé donné. Ceci nous a motivé à effectuer une étude approfondie visant à. maintenir seulement les générateurs minimaux irréductibles dans chaque classe d'équivalence, et d'élaguer les autres. À cet égard, il est proposé une réduction sans perte d'information de l'ensemble des générateurs minimaux grâce à un nouveau processus basé sur la substitution. Une étude complète des propriétés associées aux familles obtenues est présentée. Les résultats théoriques sont ensuite étendus au cadre de règles d'association afin de réduire autant que possible le nombre de règles maintenues sans perte d'information. Puis, est présentée une étude formelle complète du mécanisme d'inférence permettant de dériver toutes les règles d'association redondantes, à partir de celles maintenues. / The last years witnessed an explosive progress in networking, storage, and processing technologies resulting in an unprecedented amount of digitalization of data. There is hence a considerable need for tools or techniques to delve and efflciently discover valuable, non-obvious information from large databases. In this situation, Knowledge Discovery in Databases offers a complete process for the non-trivial extraction of implicit, previously unknown, and potentially useful knowledge from data. Amongst its steps, data mining offers tools and techniques for such an extraction. Much research in data mining from large databases has focused on the discovery of association rules which are used to identify relationships between sets of items in a database. The discovered association rules can be used in various tasks, such as depicting purchase dependencies, classification, medical data analysis, etc. In practice however, the number of frequently occurring itemsets, used as a basis for rule derivation, is very large, hampering their effective exploitation by the end-users. In this situation, a determined effort focused on defining manageably-sized sets of patterns, called concise representations, from which redundant patterns can be regenerated. The purpose of such representations is to reduce the number of mined patterns to make them manageable by the end-users while preserving as much as possible the hidden and interesting information about data. Many concise representations for frequent patterns were so far proposed in the literature, mainly exploring the conjunctive search space. In this space, itemsets are characterized by the frequency of their co-occurrence. A detailed study proposed in this thesis shows that closed itemsets and minimal generators play a key role for concisely representing both frequent itemsets and association rules. These itemsets structure the search space into equivalence classes such that each class gathers the itemsets appearing in the sanie subset (aka objects or transactions) of the given data. A closed itemset includes the most specific expression describing the associated transactions, while a minimal generator includes one of the most general expressions. However, an intra-class combinatorial redundancy would logically results from the inherent absence of a unique minimal generator associated to a given dosed item et. This motivated us to carry out an in-depth study zdming at only retaining irreducible minimal generators in each equivalence class, and pruning the remaining ones. In this respect, we propose lossless reductions of the minimal generator set thanks to a new substitution-based process. We tiien carry out a thorough study of the associated properties of the obtained families. Our tlieoretical results will then be extended to the association rule framework in order to reduce as muchas poib1e the number of retained rules without information loss. We then give a thorough formai study of the related inférence mechanism allowing to derive all redundant association rules, starting from the retained ones. In order to validate our approach, computing means for the new pattern familles are presented together with empirical evidences about their relative sizes w. r. t. the entire sets of patterns. We also lead a thorough exploration of the disjunctive search space, where itemsets are characterized by their respective disjunctive supports, instead of the conjunctive ones. Thus, an itemset verifies a portion of data if at least one of its items belongs to it. Disjunctive itemsets thus convey knowledge about complementary occurrences of items in a dataset. This exploration is motivated by the fact that, in some applications, such information - conveyed through disjunctive support - brings richer knowledge to the end-users.

An Efficient Parameter-Relationship-Based Approach for Projected Clustering

Huang, Tsun-Kuei 16 June 2008 (has links)
The clustering problem has been discussed extensively in the database literature as a tool for many applications, for example, bioinformatics. Traditional clustering algorithms consider all of the dimensions of an input dataset in an attempt to learn as much as possible about each object described. In the high dimensional data, however, many of the dimensions are often irrelevant. Therefore, projected clustering is proposed. A projected cluster is a subset C of data points together with a subset D of dimensions such that the points in C are closely clustered in the subspace of dimensions D. There have been many algorithms proposed to find the projected cluster. Most of them can be divided into three kinds of classification: partitioning, density-based, and hierarchical. The DOC algorithm is one of well-known density-based algorithms for projected clustering. It uses a Monte Carlo algorithm for iteratively computing projected clusters, and proposes a formula to calculate the quality of cluster. The FPC algorithm is an extended version of the DOC algorithm, it uses the mining large itemsets approach to find the dimensions of projected cluster. Finding the large itemsets is the main goal of mining association rules, where a large itemset is a combination of items whose appearing times in the dataset is greater than a given threshold. Although the FPC algorithm has used the technique of mining large itemsets to speed up finding projected clusters, it still needs many user-specified parameters to work. Moreover, in the first step, to choose the medoid, the FPC algorithm applies a random approach for several times to get the medoid, which takes long time and may still find a bad medoid. Furthermore, the way to calculate the quality of a cluster can be considered in more details, if we take the weight of dimensions into consideration. Therefore, in this thesis, we propose an algorithm which improves those disadvantages. First, we observe that the relationship between parameters, and propose a parameter-relationship-based algorithm that needs only two parameters, instead of three parameters in most of projected clustering algorithms. Next, our algorithm chooses the medoid with the median, we choose the medoid only one time and the quality of our cluster is better than that in the FPC algorithm. Finally, our quality measure formula considers the weight of each dimension of the cluster, and gives different values according to the times of occurrences of dimensions. This formula makes the quality of projected clustering based on our algorithm better than that of the FPC algorithm. It avoids the cluster containing too many irrelevant dimensions. From our simulation results, we show that our algorithm is better than the FPC algorithm, in term of the execution time and the quality of clustering.

A Large Itemset-Based Approach to Mining Subspace Clusters from DNA Microarray Data

Tsai, Yueh-Chi 20 June 2008 (has links)
DNA Microarrays are one of the latest breakthroughs in experimental molecular biology and have opened the possibility of creating datasets of molecular information to represent many systems of biological or clinical interest. Clustering techniques have been proven to be helpful to understand gene function, gene regulation, cellular processes, and subtypes of cells. Investigations show that more often than not, several genes contribute to a disease, which motivates researchers to identify a subset of genes whose expression levels are similar under a subset of conditions. Most of the subspace clustering models define similarity among different objects by distances over either all or only a subset of the dimensions. However, strong correlations may still exist among a set of objects, even if they are far apart from each other as measured by the distance functions. Many techniques, such as pCluster and zCluster, have been proposed to find subspace clusters with the coherence expression of a subset of genes on a subset of conditions. However, both of them contain the time-consuming steps, which are constructing gene-pair MDSs and distributing the gene information in each node of a prefix tree. Therefore, in this thesis, we propose a Large Itemset-Based Clustering (LISC) algorithm to improve the disadvantages of the pCluster and zCluster algorithms. First, we avoid to construct the gene-pair MDSs. We only construct the condition-pair MDSs to reduce the processing time. Second, we transform the task of mining the possible maximal gene sets into the mining problem of the large itemsets from the condition-pair MDSs. We make use of the concept of the large itemset which is used in mining association rules, where a large itemset is represented as a set of items appearing in a sufficient number of transactions. Since we are only interested in the subspace cluster with gene sets as large as possible, it is desirable to pay attention to those gene sets which have reasonably large support from the condition-pair MDSs. In other words, we want to find the large itemsets from the condition-pair MDSs; therefore, we obtain the gene set with respect to enough condition-pairs. In this step, we efficiently use the revised version of FP-tree structure, which has been shown to be one of the most efficient data structures for mining large itemsets, to find the large itemsets of gene sets from the condition-pair MDSs. Thus, we can avoid the complex distributing operation and reduce the search space dramatically by using the FP-tree structure. Finally, we develop an algorithm to construct the final clusters from the gene set and the condition--pair after searching the FP-tree. Since we are interested in the clusters which are large enough and not belong to any other clusters, we alternately combine or extend the gene sets and the condition sets to construct the interesting subspace clusters as large as possible. From our simulation results, we show that our proposed algorithm needs shorter processing time than those previous proposed algorithms, since they need to construct gene-pair MDSs.

A distributed approach to Frequent Itemset Mining at low support levels

Clark, Neal 22 December 2014 (has links)
Frequent Itemset Mining, the process of finding frequently co-occurring sets of items in a dataset, has been at the core of the field of data mining for the past 25 years. During this time the datasets have grown much faster than the algorithms capacity to process them. Great progress was made at optimizing this task on a single computer however, despite years of research, very little progress has been made on parallelizing this task. FPGrowth based algorithms have proven notoriously difficult to parallelize and Apriori has largely fallen out of favor with the research community. In this thesis we introduce a parallel, Apriori based, Frequent Itemset Mining algo- rithm capable of distributing computation across large commodity clusters. Our case study demonstrates that our algorithm can efficiently scale to hundreds of cores, on a standard Hadoop MapReduce cluster, and can improve executions times by at least an order of magnitude at the lowest support levels. / Graduate / 0984 / 0800 / nclark@uvic.ca


Tirupattur, Naveen 16 August 2011 (has links)
Indiana University-Purdue University Indianapolis (IUPUI) / Text Mining is process of extracting high-quality knowledge from analysis of textual data. Rapidly growing interest and focus on research in many fields is resulting in an overwhelming amount of research literature. This literature is a vast source of knowledge. But due to huge volume of literature, it is practically impossible for researchers to manually extract the knowledge. Hence, there is a need for automated approach to extract knowledge from unstructured data. Text mining is right approach for automated extraction of knowledge from textual data. The objective of this thesis is to mine documents pertaining to research literature, to find novel associations among entities appearing in that literature using Incremental Mining. Traditional text mining approaches provide binary associations. But it is important to understand context in which these associations occur. For example entity A has association with entity B in context of entity C. These contexts can be visualized as multi-way associations among the entities which are represented by a Hypergraph. This thesis work talks about extracting such multi-way associations among the entities using Frequent Itemset Mining and application of a new concept called Output space sampling to extract such multi-way associations in space and time efficient manner. We incorporated concept of personalization in Output space sampling so that user can specify his/her interests as the frequent hyper-associations are extracted from the text.

Knowledge Accelerated Algorithms and the Knowledge Cache

Goyder, Matthew 19 July 2012 (has links)
No description available.

Page generated in 0.0205 seconds