Return to search

An Efficient Parameter-Relationship-Based Approach for Projected Clustering

The clustering problem has been discussed extensively in the database literature as a tool for many applications, for example, bioinformatics. Traditional clustering algorithms consider all of the dimensions of an input dataset in an attempt to learn as much as possible about each object described. In the high dimensional data, however, many of the dimensions are often irrelevant. Therefore, projected clustering is proposed. A projected cluster is a subset C of data points together with a subset D of dimensions such that the points in C are closely clustered in the subspace of dimensions D. There have been many algorithms proposed to find the projected cluster. Most of them can be divided into three kinds of classification: partitioning, density-based, and hierarchical. The DOC algorithm is one of well-known density-based algorithms for projected clustering. It uses a Monte Carlo algorithm for iteratively computing projected clusters, and proposes a formula to calculate the quality of cluster. The FPC algorithm is an extended version of the DOC algorithm, it uses the mining large itemsets approach to find the dimensions of projected cluster. Finding the large itemsets is the main goal of mining association rules,
where a large itemset is a combination of items whose appearing times in the dataset is greater than a given threshold. Although the FPC algorithm has used the technique of mining large itemsets to speed up finding projected clusters, it still needs many user-specified parameters to work. Moreover, in the first step, to choose the medoid, the FPC algorithm applies a random approach for several times to get the medoid, which takes long time and may still find a bad medoid. Furthermore, the way to calculate the quality of a cluster can be considered in more details, if we take the weight of dimensions into consideration. Therefore, in this thesis, we propose an algorithm which improves those disadvantages. First, we observe that the relationship between parameters, and propose a parameter-relationship-based algorithm that needs only two parameters, instead of three parameters in most of projected clustering algorithms. Next, our algorithm chooses the medoid with the median, we choose the medoid only one time and the quality of our cluster is better than that in the FPC algorithm. Finally, our quality measure formula considers the weight of each dimension of the cluster, and gives different values according to the times of occurrences of dimensions. This formula makes the quality of projected clustering based on our algorithm better than that of the FPC algorithm. It avoids the cluster containing too many irrelevant dimensions. From our simulation results, we show that our algorithm is better than the FPC algorithm,
in term of the execution time and the quality of clustering.

Identiferoai:union.ndltd.org:NSYSU/oai:NSYSU:etd-0616108-212404
Date16 June 2008
CreatorsHuang, Tsun-Kuei
ContributorsShian-Hua Lin, Gen-huey Chen, Ye-In Chang, Chien-I Lee, San-Yi Huang
PublisherNSYSU
Source SetsNSYSU Electronic Thesis and Dissertation Archive
LanguageEnglish
Detected LanguageEnglish
Typetext
Formatapplication/pdf
Sourcehttp://etd.lib.nsysu.edu.tw/ETD-db/ETD-search/view_etd?URN=etd-0616108-212404
Rightsnot_available, Copyright information available at source archive

Page generated in 0.0047 seconds