Global ETD Search

Return to search

Klasifikace elektronických dokumentů s využitím shlukové analýzy / Classification of electronic documents using cluster analysis

The current age is characterised by unprecedented information growth, whether it is by amount or complexity. Most of it is available in digital form so we can analyze it using cluster analysis. We have tried to classify the documents from 20 Newsgroups collection in terms of their content only. The aim was to asses available clustering methods in a variety of applications. After the transformation into binary vector representation we performed several experiments and measured the values of entropy, purity and time of execution in application CLUTO. For a small number of clusters the best results offered the direct method (generally hierarchical method), but for more it was the repeated bisection (divisive). Agglomerative method proved not to be suitable. Using simulation we estimated the optimal number of clusters to be 10. For this solution we described in detail features of each cluster using repeated bisection method and i2 criterion function. In the future focus should be set on realisation of binary clustering with advantage of programming languages like Perl or C++. Results of this work might be of interest to web search engine developers and electronic catalogue administrators.

http://www.nusl.cz/ntk/nusl-17157

Identifer	oai:union.ndltd.org:nusl.cz/oai:invenio.nusl.cz:17157
Date	January 2009
Creators	Ševčík, Radim
Contributors	Řezanková, Hana, Svátek, Vojtěch
Publisher	Vysoká škola ekonomická v Praze
Source Sets	Czech ETDs
Language	Czech
Detected Language	English
Type	info:eu-repo/semantics/masterThesis
Rights	info:eu-repo/semantics/restrictedAccess

Page generated in 0.0023 seconds

Klasifikace elektronických dokumentů s využitím shlukové analýzy / Classification of electronic documents using cluster analysis

Description

Links & Downloads

Tags

Additional Fields