Spelling suggestions: "subject:"large itemsets""
1 |
An Efficient Parameter-Relationship-Based Approach for Projected ClusteringHuang, Tsun-Kuei 16 June 2008 (has links)
The clustering problem has been discussed extensively in the database literature as a tool for many applications, for example, bioinformatics. Traditional clustering algorithms consider all of the dimensions of an input dataset in an attempt to learn as much as possible about each object described. In the high dimensional data, however, many of the dimensions are often irrelevant. Therefore, projected clustering is proposed. A projected cluster is a subset C of data points together with a subset D of dimensions such that the points in C are closely clustered in the subspace of dimensions D. There have been many algorithms proposed to find the projected cluster. Most of them can be divided into three kinds of classification: partitioning, density-based, and hierarchical. The DOC algorithm is one of well-known density-based algorithms for projected clustering. It uses a Monte Carlo algorithm for iteratively computing projected clusters, and proposes a formula to calculate the quality of cluster. The FPC algorithm is an extended version of the DOC algorithm, it uses the mining large itemsets approach to find the dimensions of projected cluster. Finding the large itemsets is the main goal of mining association rules,
where a large itemset is a combination of items whose appearing times in the dataset is greater than a given threshold. Although the FPC algorithm has used the technique of mining large itemsets to speed up finding projected clusters, it still needs many user-specified parameters to work. Moreover, in the first step, to choose the medoid, the FPC algorithm applies a random approach for several times to get the medoid, which takes long time and may still find a bad medoid. Furthermore, the way to calculate the quality of a cluster can be considered in more details, if we take the weight of dimensions into consideration. Therefore, in this thesis, we propose an algorithm which improves those disadvantages. First, we observe that the relationship between parameters, and propose a parameter-relationship-based algorithm that needs only two parameters, instead of three parameters in most of projected clustering algorithms. Next, our algorithm chooses the medoid with the median, we choose the medoid only one time and the quality of our cluster is better than that in the FPC algorithm. Finally, our quality measure formula considers the weight of each dimension of the cluster, and gives different values according to the times of occurrences of dimensions. This formula makes the quality of projected clustering based on our algorithm better than that of the FPC algorithm. It avoids the cluster containing too many irrelevant dimensions. From our simulation results, we show that our algorithm is better than the FPC algorithm,
in term of the execution time and the quality of clustering.
|
2 |
A Large Itemset-Based Approach to Mining Subspace Clusters from DNA Microarray DataTsai, Yueh-Chi 20 June 2008 (has links)
DNA Microarrays are one of the latest breakthroughs in experimental molecular biology and have opened the possibility of creating datasets of molecular information to represent many systems of biological or clinical interest. Clustering techniques have been proven to be helpful to understand gene function, gene regulation, cellular processes, and subtypes of cells. Investigations show that more often than not, several genes contribute to a disease, which motivates researchers to identify a subset of genes whose expression levels are similar under a subset of conditions. Most of the subspace clustering models define similarity among different objects by distances over either all or only a subset of the dimensions. However, strong correlations may still exist among a set of objects, even if they are far apart from each other as measured by the distance functions. Many techniques, such as pCluster and zCluster, have been proposed to find subspace clusters with the coherence expression of a subset of genes on a subset of conditions. However, both of them contain the time-consuming steps, which are constructing gene-pair MDSs and distributing the gene information in each node of a prefix tree. Therefore, in this thesis, we propose a Large Itemset-Based Clustering (LISC) algorithm to improve the disadvantages of the pCluster and zCluster algorithms. First, we avoid to construct the gene-pair MDSs. We only construct the condition-pair MDSs to reduce the processing time. Second, we transform the task of mining the possible maximal gene sets into the mining problem of the large itemsets from the condition-pair MDSs. We make use of the concept of the large itemset which is used in mining association rules, where a large itemset is represented as a set of items appearing in a sufficient number of transactions. Since we are only interested in the subspace cluster with gene sets as large as possible, it is desirable to pay attention to those gene sets which have reasonably large support from the condition-pair MDSs. In other words, we want to find the large itemsets from the condition-pair MDSs; therefore, we obtain the gene set with respect to enough condition-pairs. In this step, we efficiently use the revised version of FP-tree structure, which has been shown to be one of the most efficient data structures for mining large itemsets, to find the large itemsets of gene sets from the condition-pair MDSs. Thus, we can avoid the complex distributing operation and reduce the search space dramatically by using the FP-tree structure. Finally, we develop an algorithm to construct the final clusters from the gene set and the condition--pair after searching the FP-tree. Since we are interested in the clusters which are large enough and not belong to any other clusters, we alternately combine or extend the gene sets and the condition sets to construct the interesting subspace clusters as large as possible. From our simulation results, we show that our proposed algorithm needs shorter processing time than those previous proposed algorithms, since they need to construct gene-pair MDSs.
|
3 |
A Study on Improving Efficiency of Privacy-Preserving Utility MiningWong, Jia-Wei 11 September 2012 (has links)
Utility mining algorithms have recently been proposed to discover high utility itemsets from a quantitative database. Factors such as profits or prices are concerned in measuring the utility values of purchased items for revealing more useful knowledge to managers. Nearly all the existing algorithms are performed in a batch way to extract high utility itemsets. In real-world applications, transactions may, however, be inserted, deleted or modified in a database. The batch mining procedure requires more computational time for rescanning the whole updated database to maintain the up-to-date knowledge. In the first part of this thesis, two algorithms for data insertion and data deletion are respectively proposed for efficiently updating the discovered high utility itemsets based on pre-large concepts. The proposed algorithms firstly partition itemsets into three parts with nine cases according to whether they are large (high), pre-large or small transaction-weighted utilization in the original database. Each part is then performed by its own procedure to maintain and update the discovered high utility itemsets. Based on the pre-large concepts, the original database only need to be rescanned for much fewer itemsets in the maintenance process of high utility itemsets.
Besides, the risk of privacy threats usually exists in the process of data collection and data dissemination. Sensitive or personal information are required to be kept as private information before they are shared or published. Privacy-preserving utility mining (PPUM) has thus become an important issue in recent years. In the second part of this thesis, two evolutionary privacy-preserving utility mining algorithms to hide sensitive high utility itemsets in data sanitization for inserting dummy transactions and deleting transactions are respectively proposed. The two evolutionary privacy-preserving utility mining algorithms find appropriate transactions for insertion and deletion in the data-sanitization process. They adopt a flexible evaluation function with three factors. Different weights are assigned to the three factors depending on users¡¦ preference. The maintenance algorithms proposed in the first part of this thesis are also used in the GA-based approach to reduce the cost of rescanning databases, thus speeding up the evaluation process of chromosomes. Experiments are conducted as well to evaluate the performance of the proposed algorithms.
|
Page generated in 0.0366 seconds