Return to search

Extending the Growing Hierarchical Self Organizing Maps for a Large Mixed-Attribute Dataset Using Spark MapReduce

In this thesis work, we propose a Map-Reduce variant of the Growing Hierarchical Self Organizing Map (GHSOM) called MR-GHSOM, which is capable of handling mixed attribute datasets of massive size. The Self Organizing Map (SOM) has proved to be a useful unsupervised data analysis algorithm. It projects a high dimensional data onto a lower dimensional grid of neurons. However, the SOM has some limitations owing to its static structure and the incapability to mirror the hierarchical relations in the data. The GHSOM overcomes these shortcomings of the SOM by providing a dynamic structure that adapts its shape according to the input data. It is capable of growing dynamically in terms of the size of the individual neuron layers to represent data at the desired granularity as well as in depth to model the hierarchical relations in the data.

However, the training of the GHSOM requires multiple passes over an input dataset. This makes it difficult to use the GHSOM for massive datasets. In this thesis work, we propose a Map-Reduce variant of the GHSOM called MR-GHSOM, which is capable of processing massive datasets. The MR-GHSOM is implemented using the Apache Spark cluster computing engine and leverages the popular Map-Reduce programming model. This enables us to exploit the usefulness and dynamic capabilities of the GHSOM even for a large dataset.

Moreover, the conventional GHSOM algorithm can handle datasets with numeric attributes only. This is owing to the fact that it relies heavily on the Euclidean space dissimilarity measures of the attribute vectors. The MR-GHSOM further extends the GHSOM to handle mixed attribute - numeric and categorical - datasets. It accomplishes this by adopting the distance hierarchy approach of managing mixed attribute datasets.

The proposed MR-GHSOM is thus capable of handling massive datasets containing mixed attributes. To demonstrate the effectiveness of the MR-GHSOM in terms of clustering of mixed attribute datasets, we present the results produced by the MR-GHSOM on some popular datasets. We further train our MR-GHSOM on a Census dataset containing mixed attributes and provide an analysis of the results.

Identiferoai:union.ndltd.org:uottawa.ca/oai:ruor.uottawa.ca:10393/33385
Date January 2015
CreatorsMalondkar, Ameya Mohan
ContributorsJapkowicz, Nathalie, Kiringa, Iluju
PublisherUniversité d'Ottawa / University of Ottawa
Source SetsUniversité d’Ottawa
LanguageEnglish
Detected LanguageEnglish
TypeThesis

Page generated in 0.0021 seconds