Global ETD Search

1	Advances in categorical data clustering Zhang, Yiqun 29 August 2019 (has links) Categorical data are common in various research areas, and clustering is a prevalent technique used for analyse them. However, two challenging problems are encountered in categorical data clustering analysis. The first is that most categorical data distance metrics were actually proposed for nominal data (i.e., a categorical data set that comprises only nominal attributes), ignoring the fact that ordinal attributes are also common in various categorical data sets. As a result, these nominal data distance metrics cannot account for the order information of ordinal attributes and may thus inappropriately measure the distances for ordinal data (i.e., a categorical data set that comprises only ordinal attributes) and mixed categorical data (i.e., a categorical data set that comprises both ordinal and nominal attributes). The second problem is that most hierarchical clustering approaches were actually designed for numerical data and have very high computation costs; that is, with time complexity O(N2) for a data set with N data objects. These issues have presented huge obstacles to the clustering analysis of categorical data. To address the ordinal data distance measurement problem, we studied the characteristics of ordered possible values (also called 'categories' interchangeably in this thesis) of ordinal attributes and propose a novel ordinal data distance metric, which we call the Entropy-Based Distance Metric (EBDM), to quantify the distances between ordinal categories. The EBDM adopts cumulative entropy as a measure to indicate the amount of information in the ordinal categories and simulates the thinking process of changing one's mind between two ordered choices to quantify the distances according to the amount of information in the ordinal categories. The order relationship and the statistical information of the ordinal categories are both considered by the EBDM for more appropriate distance measurement. Experimental results illustrate the superiority of the proposed EBDM in ordinal data clustering. In addition to designing an ordinal data distance metric, we further propose a unified categorical data distance metric that is suitable for distance measurement of all three types of categorical data (i.e., ordinal data, nominal data, and mixed categorical data). The extended version uniformly defines distances and attribute weights for both ordinal and nominal attributes, by which the distances measured for the two types of attributes of a mixed categorical data can be directly combined to obtain the overall distances between data objects with no information loss. Extensive experiments on all three types of categorical data sets demonstrate the effectiveness of the unified distance metric in clustering analysis of categorical data. To address the hierarchical clustering problem of large-scale categorical data, we propose a fast hierarchical clustering framework called the Growing Multi-layer Topology Training (GMTT). The most significant merit of this framework is its ability to reduce the time complexity of most existing hierarchical clustering frameworks (i.e., O(N2)) to O(N1.5) without sacrificing the quality (i.e., clustering accuracy and hierarchical details) of the constructed hierarchy. According to our design, the GMTT framework is applicable to categorical data clustering simply by adopting a categorical data distance metric. To make the GMTT framework suitable for the processing of streaming categorical data, we also provide an incremental version of GMTT that can dynamically adopt new inputs into the hierarchy via local updating. Theoretical analysis proves that the GMTT frameworks have time complexity O(N1.5). Extensive experiments show the efficacy of the GMTT frameworks and demonstrate that they achieve more competitive categorical data clustering performance by adopting the proposed unified distance metric. Cluster analysis ; Data processing
2	Co-clustering algorithms : extensions and applications Cho, Hyuk 07 September 2012 (has links) Co-clustering is rather a recent paradigm for unsupervised data analysis, but it has become increasingly popular because of its potential to discover latent local patterns, otherwise unapparent by usual unsupervised algorithms such as k-means. Wide deployment of co-clustering, however, requires addressing a number of practical challenges such as data transformation, cluster initialization, scalability, and so on. Therefore, this thesis focuses on developing sophisticated co-clustering methodologies to maturity and its ultimate goal is to promote co-clustering as an invaluable and indispensable unsupervised analysis tool for varied practical applications. To achieve this goal, we explore the three specific tasks: (1) development of co-clustering algorithms to be functional, adaptable, and scalable (co-clustering algorithms); (2) extension of co-clustering algorithms to incorporate application-specific requirements (extensions); and (3) application of co-clustering algorithms broadly to existing and emerging problems in practical application domains (applications). As for co-clustering algorithms, we develop two fast Minimum Sum-Squared Residue Co-clustering (MSSRCC) algorithms [CDGS04], which simultaneously cluster data points and features via an alternating minimization scheme and generate co-clusters in a “checkerboard” structure. The first captures co-clusters with constant values, while the other discovers co-clusters with coherent “trends” as well as constant values. We note that the proposed algorithms are two special cases (bases 2 and 6 with Euclidean distance, respectively) of the general co-clustering framework, Bregman Co-clustering (BCC) [BDG+07], which contains six Euclidean BCC and six I-divergence BCC algorithms. Then, we substantially enhance the performance of the two MSSRCC algorithms by escaping from poor local minima and resolving the degeneracy problem of generating empty clusters in partitional clustering algorithms through the three specific strategies: (1) data transformation; (2) deterministic spectral initialization; and (3) local search strategy. Concerning co-clustering extensions, we investigate general algorithmic strategies for the general BCC framework, since it is applicable to a large class of distance measures and data types. We first formalize various data transformations for datasets with varied scaling and shifting factors, mathematically justify their effects on the six Euclidean BCC algorithms, and empirically validate the analysis results. We also adapt the local search strategy, initially developed for the two MSSRCC algorithms, to all the twelve BCC algorithms. Moreover, we consider variations of cluster assignments and cluster updates, including greedy vs. non-greedy cluster assignment, online vs. batch cluster update, and so on. Furthermore, in order to provide better scalability and usability, we parallelize all the twelve BCC algorithms, which are capable of co-clustering large-scaled datasets over multiple processors. Regarding co-clustering applications, we extend the functionality of BCC to incorporate application-specific requirements: (1) discovery of inverted patterns, whose goal is to find anti-correlation; (2) discovery of coherent co-clusters from noisy data, whose purpose is to do dimensional reduction and feature selection; and (3) discovery of patterns from time-series data, whose motive is to guarantee critical time-locality. Furthermore, we employ co-clustering to pervasive computing for mobile devices, where the task is to extract latent patterns from usage logs as well as to recognize specific situations of mobile-device users. Finally, we demonstrate the applicability of our proposed algorithms for aforementioned applications through empirical results on various synthetic and real-world datasets. In summary, we present co-clustering algorithms to discover latent local patterns, propose their algorithmic extensions to incorporate specific requirements, and provide their applications to a wide range of practical domains. / text Cluster analysis--Data processing Algorithms
3	A study on privacy-preserving clustering Cui, Yingjie., 崔英杰. January 2009 (has links) published_or_final_version / Computer Science / Master / Master of Philosophy Database security. Cluster analysis - Data processing.
4	Clustering uncertain data using Voronoi diagram Lee, King-for, Foris., 李敬科. January 2009 (has links) published_or_final_version / Computer Science / Master / Master of Philosophy Cluster analysis - Data processing. Voronoi polygons. Algorithms.
5	Entropy-based subspace clustering for mining numerical data. January 1999 (has links) by Cheng, Chun-hung. / Thesis (M.Phil.)--Chinese University of Hong Kong, 1999. / Includes bibliographical references (leaves 72-76). / Abstracts in English and Chinese. / Abstract --- p.ii / Acknowledgments --- p.iv / Chapter 1 --- Introduction --- p.1 / Chapter 1.1 --- Six Tasks of Data Mining --- p.1 / Chapter 1.1.1 --- Classification --- p.2 / Chapter 1.1.2 --- Estimation --- p.2 / Chapter 1.1.3 --- Prediction --- p.2 / Chapter 1.1.4 --- Market Basket Analysis --- p.3 / Chapter 1.1.5 --- Clustering --- p.3 / Chapter 1.1.6 --- Description --- p.3 / Chapter 1.2 --- Problem Description --- p.4 / Chapter 1.3 --- Motivation --- p.5 / Chapter 1.4 --- Terminology --- p.7 / Chapter 1.5 --- Outline of the Thesis --- p.7 / Chapter 2 --- Survey on Previous Work --- p.8 / Chapter 2.1 --- Data Mining --- p.8 / Chapter 2.1.1 --- Association Rules and its Variations --- p.9 / Chapter 2.1.2 --- Rules Containing Numerical Attributes --- p.15 / Chapter 2.2 --- Clustering --- p.17 / Chapter 2.2.1 --- The CLIQUE Algorithm --- p.20 / Chapter 3 --- Entropy and Subspace Clustering --- p.24 / Chapter 3.1 --- Criteria of Subspace Clustering --- p.24 / Chapter 3.1.1 --- Criterion of High Density --- p.25 / Chapter 3.1.2 --- Correlation of Dimensions --- p.25 / Chapter 3.2 --- Entropy in a Numerical Database --- p.27 / Chapter 3.2.1 --- Calculation of Entropy --- p.27 / Chapter 3.3 --- Entropy and the Clustering Criteria --- p.29 / Chapter 3.3.1 --- Entropy and the Coverage Criterion --- p.29 / Chapter 3.3.2 --- Entropy and the Density Criterion --- p.31 / Chapter 3.3.3 --- Entropy and Dimensional Correlation --- p.33 / Chapter 4 --- The ENCLUS Algorithms --- p.35 / Chapter 4.1 --- Framework of the Algorithms --- p.35 / Chapter 4.2 --- Closure Properties --- p.37 / Chapter 4.3 --- Complexity Analysis --- p.39 / Chapter 4.4 --- Mining Significant Subspaces --- p.40 / Chapter 4.5 --- Mining Interesting Subspaces --- p.42 / Chapter 4.6 --- Example --- p.44 / Chapter 5 --- Experiments --- p.49 / Chapter 5.1 --- Synthetic Data --- p.49 / Chapter 5.1.1 --- Data Generation ´ؤ Hyper-rectangular Data --- p.49 / Chapter 5.1.2 --- Data Generation ´ؤ Linearly Dependent Data --- p.50 / Chapter 5.1.3 --- Effect of Changing the Thresholds --- p.51 / Chapter 5.1.4 --- Effectiveness of the Pruning Strategies --- p.53 / Chapter 5.1.5 --- Scalability Test --- p.53 / Chapter 5.1.6 --- Accuracy --- p.55 / Chapter 5.2 --- Real-life Data --- p.55 / Chapter 5.2.1 --- Census Data --- p.55 / Chapter 5.2.2 --- Stock Data --- p.56 / Chapter 5.3 --- Comparison with CLIQUE --- p.58 / Chapter 5.3.1 --- Subspaces with Uniform Projections --- p.60 / Chapter 5.4 --- Problems with Hyper-rectangular Data --- p.62 / Chapter 6 --- Miscellaneous Enhancements --- p.64 / Chapter 6.1 --- Extra Pruning --- p.64 / Chapter 6.2 --- Multi-resolution Approach --- p.65 / Chapter 6.3 --- Multi-threshold Approach --- p.68 / Chapter 7 --- Conclusion --- p.70 / Bibliography --- p.71 / Appendix --- p.77 / Chapter A --- Differential Entropy vs Discrete Entropy --- p.77 / Chapter A.1 --- Relation of Differential Entropy to Discrete Entropy --- p.78 / Chapter B --- Mining Quantitative Association Rules --- p.80 / Chapter B.1 --- Approaches --- p.81 / Chapter B.2 --- Performance --- p.82 / Chapter B.3 --- Final Remarks --- p.83 Data mining Cluster analysis--Data processing
6	A new approach of classification of time series database. January 2011 (has links) Chan, Hon Kit. / Thesis (M.Phil.)--Chinese University of Hong Kong, 2011. / Includes bibliographical references (p. 57-59). / Abstracts in English and Chinese. / Chapter 1 --- Introduction --- p.1 / Chapter 1.1 --- Cluster Analysis in Time Series --- p.1 / Chapter 1.2 --- Dissimilarity Measure --- p.2 / Chapter 1.2.1 --- Euclidean Distance --- p.3 / Chapter 1.2.2 --- Pearson's Correlation Coefficient --- p.3 / Chapter 1.2.3 --- Other Measure --- p.4 / Chapter 1.3 --- Summary --- p.5 / Chapter 2 --- Algorithm and Methodology --- p.8 / Chapter 2.1 --- Algorithm and Methodology --- p.8 / Chapter 2.2 --- Illustrative Examples --- p.14 / Chapter 3 --- Simulation Study --- p.20 / Chapter 3.1 --- Simulation Plan --- p.20 / Chapter 3.2 --- Measure of Performance --- p.24 / Chapter 3.3 --- Simulation Results --- p.27 / Chapter 3.4 --- Results of k-means Clustering --- p.33 / Chapter 4 --- Application on Gene Expression --- p.37 / Chapter 4.1 --- Dataset --- p.37 / Chapter 4.2 --- Parameter Settings --- p.38 / Chapter 4.3 --- Results --- p.38 / Chapter 5 --- Conclusion and Further Research --- p.55 Cluster analysis--Data processing Time-series analysis
7	Robust methods for locating multiple dense regions in complex datasets Gupta, Gunjan Kumar 28 August 2008 (has links) Not available / text Cluster analysis--Data processing Computer algorithms
8	Aggregate programming in large scale linear systems Taylor, Richard Winthrop 08 1900 (has links) No description available. Cluster analysis Data processing Linear programming
9	Clustering with genetic algorithms Cole, Rowena Marie January 1998 (has links) Clustering is the search for those partitions that reflect the structure of an object set. Traditional clustering algorithms search only a small sub-set of all possible clusterings (the solution space) and consequently, there is no guarantee that the solution found will be optimal. We report here on the application of Genetic Algorithms (GAs) -- stochastic search algorithms touted as effective search methods for large and complex spaces -- to the problem of clustering. GAs which have been made applicable to the problem of clustering (by adapting the representation, fitness function, and developing suitable evolutionary operators) are known as Genetic Clustering Algorithms (GCAs). There are two parts to our investigation of GCAs: first we look at clustering into a given number of clusters. The performance of GCAs on three generated data sets, analysed using 4320 differing combinations of adaptions, establishes their efficacy. Choice of adaptions and parameter settings is data set dependent, but comparison between results using generated and real data sets indicate that performance is consistent for similar data sets with the same number of objects, clusters, attributes, and a similar distribution of objects. Generally, group-number representations are better suited to the clustering problem, as are dynamic scaling, elite selection and high mutation rates. Independent generalised models fitted to the correctness and timing results for each of the generated data sets produced accurate predictions of the performance of GCAs on similar real data sets. While GCAs can be successfully adapted to clustering, and the method produces results as accurate and correct as traditional methods, our findings indicate that, given a criterion based on simple distance metrics, GCAs provide no advantages over traditional methods. Second, we investigate the potential of genetic algorithms for the more general clustering problem, where the number of clusters is unknown. We show that only simple modifications to the adapted GCAs are needed. We have developed a merging operator, which with elite selection, is employed to evolve an initial population with a large number of clusters toward better clusterings. With regards to accuracy and correctness, these GCAs are more successful than optimisation methods such as simulated annealing. However, such GCAs can become trapped in local minima in the same manner as traditional hierarchical methods. Such trapping is characterised by the situation where good (k-1)-clusterings do not result from our merge operator acting on good k-clusterings. A marked improvement in the algorithm is observed with the addition of a local heuristic. Cluster analysis -- Data processing Genetic algorithms
10	Bayesian decision theoretical framework for clustering. / CUHK electronic theses & dissertations collection January 2011 (has links) By the Bayesian decision theoretical view, we propose several extensions of current popular graph based methods. Several data-dependent graph construction approaches are proposed by adopting more flexible density estimators. The advantage of these approaches is that the parameters for constructing the graph can be estimated from the data. The constructed graph explores the intrinsic distribution of the data. As a result, the algorithm is more robust. It can obtain good performance constantly across different data sets. Using the flexible density models can result in directed graphs which cannot be handled by traditional graph partitioning algorithms. To tackle this problem, we propose general algorithms for graph partitioning, which can deal with both undirected and directed graphs in a unified way. / In this thesis, we establish a novel probabilistic framework for the data clustering problem from the perspective of Bayesian decision theory. The Bayesian decision theory view justifies the important questions: what is a cluster and what a clustering algorithm should optimize. / We prove that the spectral clustering (to be specific, the normalized cut) algorithm can be derived from this framework. Especially, it can be shown that the normalized cut is a nonparametric clustering method which adopts a kernel density estimator as its density model and tries to minimize the expected classification error or Bayes risk. / Chen, Mo. / Adviser: Xiaoou Tang. / Source: Dissertation Abstracts International, Volume: 73-06, Section: B, page: . / Thesis (Ph.D.)--Chinese University of Hong Kong, 2011. / Includes bibliographical references (leaves 96-104). / Electronic reproduction. Hong Kong : Chinese University of Hong Kong, [2012] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Electronic reproduction. [Ann Arbor, MI] : ProQuest Information and Learning, [201-] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Abstract also in Chinese. Bayesian statistical decision theory Cluster analysis--Data processing

Search results