Return to search

Individualiai klasifikuotų dokumentų klasterizavimo metodas / Clustering Method for Personally Classified Documents

Traditional clustering methods, where documents are represented by term frequency vectors, are not very suitable for Lithuanian document clustering as there is no any freely available morphological analyzer or stemmer to make compact term dictionaries. It is still possible though to cluster Lithuanian documents using loose term dictionaries, but as Lithuanian is a highly synthetic language significant increase in resources and possibly inaccurate or distorted results must be taken into account. In this master thesis a clustering method for personally classified documents is deve­loped to overcome shortcomings of traditional document clustering stated above. In a new method documents are represented by tag frequency vectors, pair-wise similarities are measured by cosine coefficient and clustering itself is performed using experimentally selected bisecting K‑means algorithm. Experiments comparing developed method with traditional document clustering using loose term dictionary showed that former copes better with large document collections and/or large cluster number. At the same time subjective clustering estimation showed that even when new method demonstrates larger entropy and lower purity values, it still overcomes traditional method by clustering sense.

Identiferoai:union.ndltd.org:LABT_ETD/oai:elaba.lt:LT-eLABa-0001:E.02~2006~D_20060522_143851-15319
Date22 May 2006
CreatorsŽalinauskas, Marius
ContributorsŠeinauskas, Rimantas, Motiejūnas, Kęstutis, Kazanavičius, Egidijus, Butleris, Rimantas, Karčiauskas, Eimutis, Bareiša, Eduardas, Tomkevičius, Arūnas, Štuikys, Vytautas, Stulpinas, Raimundas, Kaunas University of Technology
PublisherLithuanian Academic Libraries Network (LABT), Kaunas University of Technology
Source SetsLithuanian ETD submission system
LanguageLithuanian
Detected LanguageEnglish
TypeMaster thesis
Formatapplication/pdf
Sourcehttp://vddb.library.lt/obj/LT-eLABa-0001:E.02~2006~D_20060522_143851-15319
RightsUnrestricted

Page generated in 0.0014 seconds