Global ETD Search

1	A Sliding-Window Approach to Mining Maximal Large Itemsets for Large Databases Chang, Yuan-feng 28 July 2004 (has links) Mining association rules, means a process of nontrivial extraction of implicit, previously and potentially useful information from data in databases. Mining maximal large itemsets is a further work of mining association rules, which aims to find the set of all subsets of large (frequent) itemsets that could be representative of all large itemsets. Previous algorithms to mining maximal large itemsets can be classified into two approaches: exhausted and shortcut. The shortcut approach could generate smaller number of candidate itemsets than the exhausted approach, resulting in better performance in terms of time and storage space. On the other hand, when updates to the transaction databases occur, one possible approach is to re-run the mining algorithm on the whole database. The other approach is incremental mining, which aims for efficient maintenance of discovered association rules without re-running the mining algorithms. However, previous algorithms for mining maximal large itemsets based on the shortcut approach can not support incremental mining for mining maximal large itemsets. While the algorithms for incremental mining, {it e.g.}, the SWF algorithm, could not efficiently support mining maximal large itemsets, since it is based on the exhausted approach. Therefore, in this thesis, we focus on the design of an algorithm which could provide good performance for both mining maximal itemsets and incremental mining. Based on some observations, for example, ``{it if an itemset is large, all its subsets must be large; therefore, those subsets need not to be examined further}", we propose a Sliding-Window approach, the SWMax algorithm, for efficiently mining maximal large itemsets and incremental mining. Our SWMax algorithm is a two-passes partition-based approach. We will find all candidate 1-itemsets ($C_1$), candidate 3-itemsets ($C_3$), large 1-itemsets ($L_1$), and large 3-itemsets ($L_3$) in the first pass. We generate the virtual maximal large itemsets after the first pass. Then, we use $L_1$ to generate $C_2$, use $L_3$ to generate $C_4$, use $C_4$ to generate $C_5$, until there is no $C_k$ generated. In the second pass, we use the virtual maximal large itemsets to prune $C_k$, and decide the maximal large itemsets. For incremental mining, we consider two cases: (1) data insertion, (2) data deletion. Both in Case 1 and Case 2, if an itemset with size equal to 1 is not large in the original database, it could not be found in the updated database based on the SWF algorithm. That is, a missing case could occur in the incremental mining process of the SWF algorithm, because the SWF algorithm only keeps the $C_2$ information. While our SWMax algorithm could support incremental mining correctly, since $C_1$ and $C_3$ are maintained in our algorithm. We generate some synthetic databases to simulate the real transaction databases in our simulation. From our simulation, the results show that our SWMax algorithm could generate fewer number of candidates and needs less time than the SWF algorithm. partition Incremental Mining Maximal Large Itemsets Data Mining Association Rules
2	An Efficient Union Approach to Mining Closed Large Itemsets in DNA Microarray Datasets Lee, Li-Wen 05 July 2006 (has links) A DNA microarray is a very good tool to study the gene expression level in different situations. Mining association rules in DNA microarray datasets can help us know how genes affect each other, and what genes are usually co-expressed. Mining closed large itemsets can be useful for reducing the size of the mining result from the DNA microarray datasets, where a closed itemset is an itemset that there is no superset whose support value is the same as the support value of this itemset. Since the number of genes stored in columns is much larger than the number of samples stored in rows in a DNA microarray dataset, traditional mining methods which use column enumeration face a great challenge, where the column enumeration means that enumerating itemsets from the combinations of items stored in columns. Therefore, several row enumeration methods, e.g., RERII, have been proposed to solve this problem, where row enumeration means that enumerating itemsets from the combinations of items stored in rows. Although the RERII method saves more memory space and has better performance than the other row enumeration methods, it needs complex intersection operations at each node of the row enumeration tree to generate the closed itemsets. In this thesis, we propose a new method, UMiner, which is based on the union operations to mine the closed large itemsets in the DNA microarray datasets. Our approach is a row enumeration method. First, we add all tuples in the transposed table to a prefix tree, where a transposed table records the information about where an item appears in the original table. Next, we traverse this prefix tree to create a row-node table which records the information about a node and the related range of its child nodes in the prefix tree created from the transposed table. Then we generate the closed itemset by using the union operations on the itemsets in the item groups stored in the row-node table. Since we do not use the intersection operations to generate the closed itemset for each enumeration node, we can reduce the time complexity that is needed at each node of the row enumeration tree. Moreover, we develop four pruning techniques to reduce the number of candidate closed itemsets in the row enumeration tree. By replacing the complex intersection operations with the union operations and designing four pruning techniques to reduce the number of branches in the row enumeration tree, our method can find closed large itemsets very efficiently. In our performance study, we use three real datasets which are the clinical data on breast cancer, lung cancer, and AML-ALL. From the experiment results, we show that our UMiner method is always faster than the RERII method in all support values, no matter what the average length of the closed large itemsets is. Moreover, in our simulation result, we also show that the processing time of our method increases much more slowly than that of the RERII method as the average number of items in the rows of a dataset increases. row enumeration microarray large itemsets gene expression closed itemsets

Search results

A Sliding-Window Approach to Mining Maximal Large Itemsets for Large Databases

An Efficient Union Approach to Mining Closed Large Itemsets in DNA Microarray Datasets