Global ETD Search

Return to search

Similarity Measures for Nominal Data in Hierarchical Clustering / Míry podobnosti pro nominální data v hierarchickém shlukování

This dissertation thesis deals with similarity measures for nominal data in hierarchical clustering, which can cope with variables with more than two categories, and which aspire to replace the simple matching approach standardly used in this area. These similarity measures take into account additional characteristics of a dataset, such as frequency distribution of categories or number of categories of a given variable. The thesis recognizes three main aims. The first one is an examination and clustering performance evaluation of selected similarity measures for nominal data in hierarchical clustering of objects and variables. To achieve this goal, four experiments dealing both with the object and variable clustering were performed. They examine the clustering quality of the examined similarity measures for nominal data in comparison with the commonly used similarity measures using a binary transformation, and moreover, with several alternative methods for nominal data clustering. The comparison and evaluation are performed on real and generated datasets. Outputs of these experiments lead to knowledge, which similarity measures can generally be used, which ones perform well in a particular situation, and which ones are not recommended to use for an object or variable clustering. The second aim is to propose a theory-based similarity measure, evaluate its properties, and compare it with the other examined similarity measures. Based on this aim, two novel similarity measures, Variable Entropy and Variable Mutability are proposed; especially, the former one performs very well in datasets with a lower number of variables. The third aim of this thesis is to provide a convenient software implementation based on the examined similarity measures for nominal data, which covers the whole clustering process from a computation of a proximity matrix to evaluation of resulting clusters. This goal was also achieved by creating the nomclust package for the software R, which covers this issue, and which is freely available.

http://www.nusl.cz/ntk/nusl-261939

Identifer	oai:union.ndltd.org:nusl.cz/oai:invenio.nusl.cz:261939
Date	January 2013
Creators	Šulc, Zdeněk
Contributors	Řezanková, Hana, Šimůnek, Milan, Žambochová, Marta
Publisher	Vysoká škola ekonomická v Praze
Source Sets	Czech ETDs
Language	English
Detected Language	English
Type	info:eu-repo/semantics/doctoralThesis
Rights	info:eu-repo/semantics/restrictedAccess

Page generated in 0.0017 seconds

Similarity Measures for Nominal Data in Hierarchical Clustering / Míry podobnosti pro nominální data v hierarchickém shlukování

Description

Links & Downloads

Tags

Additional Fields