Global ETD Search

1	Clustering High-dimensional Noisy Categorical and Mixed Data Zhiyi Tian (10925280) 27 July 2021 (has links) Clustering is an unsupervised learning technique widely used to group data into homogeneous clusters. For many real-world data containing categorical values, existing algorithms are often computationally costly in high dimensions, do not work well on noisy data with missing values, and rarely provide theoretical guarantees on clustering accuracy. In this thesis, we propose a general categorical data encoding method and a computationally efficient spectral based algorithm to cluster high-dimensional noisy categorical (nominal or ordinal) data. Under a statistical model for data on m attributes from n subjects in r clusters with missing probability epsilon, we show that our algorithm exactly recovers the true clusters with high probability when mn(1-epsilon) >= CMr<sup>2</sup> log<sup>3</sup>M, with M=max(n,m) and a fixed constant C. Moreover, we show that mn(1- epsilon)<sup>2</sup> >= r *delta/2 with 0< delta <1 is necessary for any algorithm to succeed with probability at least (1+delta)/2. In case, where m=n and r is fixed, for example, the sufficient condition matches with the necessary condition up to a polylog(n) factor, showing that our proposed algorithm is nearly optimal. We also show our algorithm outperforms several existing algorithms in both clustering accuracy and computational efficiency, both theoretically and numerically. In addition, we propose a spectral algorithm with standardization to cluster mixed data. This algorithm is computationally efficient and its clustering accuracy has been evaluated numerically on both real world data and synthetic data. Statistics Clustering Categorical values Noisy and sparse data High dimensions Spectral algorithm Clustering accuracy

Search results

Clustering High-dimensional Noisy Categorical and Mixed Data