DNA Microarrays are one of the latest breakthroughs in experimental molecular biology and have opened the possibility of creating datasets of molecular information to represent many systems of biological or clinical interest. Clustering techniques have been proven to be helpful to understand gene function, gene regulation, cellular processes, and subtypes of cells. Investigations show that more often than not, several genes contribute to a disease, which motivates researchers to identify a subset of genes whose expression levels are similar under a subset of conditions. Most of the subspace clustering models define similarity among different objects by distances over either all or only a subset of the dimensions. However, strong correlations may still exist among a set of objects, even if they are far apart from each other as measured by the distance functions. Many techniques, such as pCluster and zCluster, have been proposed to find subspace clusters with the coherence expression of a subset of genes on a subset of conditions. However, both of them contain the time-consuming steps, which are constructing gene-pair MDSs and distributing the gene information in each node of a prefix tree. Therefore, in this thesis, we propose a Large Itemset-Based Clustering (LISC) algorithm to improve the disadvantages of the pCluster and zCluster algorithms. First, we avoid to construct the gene-pair MDSs. We only construct the condition-pair MDSs to reduce the processing time. Second, we transform the task of mining the possible maximal gene sets into the mining problem of the large itemsets from the condition-pair MDSs. We make use of the concept of the large itemset which is used in mining association rules, where a large itemset is represented as a set of items appearing in a sufficient number of transactions. Since we are only interested in the subspace cluster with gene sets as large as possible, it is desirable to pay attention to those gene sets which have reasonably large support from the condition-pair MDSs. In other words, we want to find the large itemsets from the condition-pair MDSs; therefore, we obtain the gene set with respect to enough condition-pairs. In this step, we efficiently use the revised version of FP-tree structure, which has been shown to be one of the most efficient data structures for mining large itemsets, to find the large itemsets of gene sets from the condition-pair MDSs. Thus, we can avoid the complex distributing operation and reduce the search space dramatically by using the FP-tree structure. Finally, we develop an algorithm to construct the final clusters from the gene set and the condition--pair after searching the FP-tree. Since we are interested in the clusters which are large enough and not belong to any other clusters, we alternately combine or extend the gene sets and the condition sets to construct the interesting subspace clusters as large as possible. From our simulation results, we show that our proposed algorithm needs shorter processing time than those previous proposed algorithms, since they need to construct gene-pair MDSs.
Identifer | oai:union.ndltd.org:NSYSU/oai:NSYSU:etd-0620108-172233 |
Date | 20 June 2008 |
Creators | Tsai, Yueh-Chi |
Contributors | Gen-huey Chen, Chien-I Lee, Shian-Hua Lin, San-Yi Huang, Ye-In Chang |
Publisher | NSYSU |
Source Sets | NSYSU Electronic Thesis and Dissertation Archive |
Language | English |
Detected Language | English |
Type | text |
Format | application/pdf |
Source | http://etd.lib.nsysu.edu.tw/ETD-db/ETD-search/view_etd?URN=etd-0620108-172233 |
Rights | not_available, Copyright information available at source archive |
Page generated in 0.0297 seconds