Return to search

Similarity Measures for Nominal Data in Hierarchical Clustering / Míry podobnosti pro nominální data v hierarchickém shlukování

This dissertation thesis deals with similarity measures for nominal data in hierarchical clustering, which can cope with variables with more than two categories, and which aspire to replace the simple matching approach standardly used in this area. These similarity measures take into account additional characteristics of a dataset, such as frequency distribution of categories or number of categories of a given variable. The thesis recognizes three main aims. The first one is an examination and clustering performance evaluation of selected similarity measures for nominal data in hierarchical clustering of objects and variables. To achieve this goal, four experiments dealing both with the object and variable clustering were performed. They examine the clustering quality of the examined similarity measures for nominal data in comparison with the commonly used similarity measures using a binary transformation, and moreover, with several alternative methods for nominal data clustering. The comparison and evaluation are performed on real and generated datasets. Outputs of these experiments lead to knowledge, which similarity measures can generally be used, which ones perform well in a particular situation, and which ones are not recommended to use for an object or variable clustering. The second aim is to propose a theory-based similarity measure, evaluate its properties, and compare it with the other examined similarity measures. Based on this aim, two novel similarity measures, Variable Entropy and Variable Mutability are proposed; especially, the former one performs very well in datasets with a lower number of variables. The third aim of this thesis is to provide a convenient software implementation based on the examined similarity measures for nominal data, which covers the whole clustering process from a computation of a proximity matrix to evaluation of resulting clusters. This goal was also achieved by creating the nomclust package for the software R, which covers this issue, and which is freely available.

Identiferoai:union.ndltd.org:nusl.cz/oai:invenio.nusl.cz:261939
Date January 2013
CreatorsŠulc, Zdeněk
ContributorsŘezanková, Hana, Šimůnek, Milan, Žambochová, Marta
PublisherVysoká škola ekonomická v Praze
Source SetsCzech ETDs
LanguageEnglish
Detected LanguageEnglish
Typeinfo:eu-repo/semantics/doctoralThesis
Rightsinfo:eu-repo/semantics/restrictedAccess

Page generated in 0.0019 seconds