Return to search

High-dimensional data mining: subspace clustering, outlier detection and applications to classification

Data mining in high dimensionality almost inevitably faces the consequences of increasing sparsity and declining differentiation between points. This is problematic because we usually exploit these differences for approaches such as clustering and outlier detection. In addition, the exponentially increasing sparsity tends to increase false negatives when clustering.
In this thesis, we address the problem of solving high-dimensional problems using low-dimensional solutions. In clustering, we provide a new framework MAXCLUS for finding candidate subspaces and the clusters within them using only two-dimensional clustering. We demonstrate this through an implementation GCLUS that outperforms many state-of-the-art clustering algorithms and is particularly robust with respect to noise. It also handles overlapping clusters and provides either `hard' or `fuzzy' clustering results as desired. In order to handle extremely high dimensional problems, such as genome microarrays, given some sample-level diagnostic labels, we provide a simple but effective classifier GSEP which weights the features so that the most important can be fed to GCLUS. We show that this leads to small numbers of features (e.g. genes) that can distinguish the diagnostic classes and thus are candidates for research for developing therapeutic applications.
In the field of outlier detection, several novel algorithms suited to high-dimensional data are presented (T*ENT, T*ROF, FASTOUT). It is shown that these algorithms outperform the state-of-the-art outlier detection algorithms in ranking outlierness for many datasets regardless of whether they contain rare classes or not. Our research into high-dimensional outlier detection has even shown that our approach can be a powerful means of classification for heavily overlapping classes given sufficiently high dimensionality and that this phenomenon occurs solely due to the differences in variance among the classes. On some difficult datasets, this unsupervised approach yielded better separation than the very best supervised classifiers and on other data, the results are competitive with state-of-the-art supervised approaches.kern-1pt The elucidation of this novel approach to classification opens a new field in data mining, classification through differences in variance rather than spatial location.
As an appendix, we provide an algorithm for estimating false negative and positive rates so these can be compensated for.

Identiferoai:union.ndltd.org:LACETR/oai:collectionscanada.gc.ca:AEU.10048/970
Date06 1900
CreatorsFoss, Andrew
ContributorsOsmar Zaiane, Computing Science, Raymond Ng, Computer Science, University of British Columbia, Dale Schuurmans, Computing Science, Joerg Sander, Computing Science, Mauricio Sacchi, Physics
Source SetsLibrary and Archives Canada ETDs Repository / Centre d'archives des thèses électroniques de Bibliothèque et Archives Canada
LanguageEnglish
Detected LanguageEnglish
TypeThesis
Format1557343 bytes, application/pdf
RelationAndrew Foss, Sandra Zilles and Osmar R. Zaiane. Unsupervised Class Separation of Multivariate Data through Cumulative Variance-based Ranking. In Proceedings of IEEE International Conference on Data Mining ICDM09, pp. 139-148, 2009., Andrew Foss and Osmar R. Zaiane. The Estimation of True and False Positive Rates in Higher Dimensional Problems and its Data Mining Applications. In Proc. of Foundations of Data Mining Workshop, in conjunction with IEEE International Conference on Data Mining ICDM'08. pp. 673-681, 2008, Hongqin Fan, Osmar R. Zaiane, Andrew Foss and JunfengWu, Resolution-Based Outlier Factor: Detecting the Top-n Most Outlying Data Points in Engineering Data. Knowledge and Information Systems, An International Journal, pp. 31-51, 2009., Hongqin Fan, Osmar R. Zaiane, Andrew Foss, and Junfeng Wu. A Nonparametric Outlier Detection for Effectively Discovering Top-N outliers from Engineering Data. In Proc. of the 10th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD06), 2006., Andrew Foss and Osmar R. Zaiane. Effective Subspace Clustering with Dimension Pairing. Technical Report TR06-23, University of Alberta, Oct 2006

Page generated in 0.007 seconds