Return to search

Improving Search Result Clustering By Integrating Semantic Information From Wikipedia

Suffix Tree Clustering (STC) is a search result clustering (SRC) algorithm focused on generating overlapping clusters with meaningful labels in linear time. It showed the feasibility of SRC but in time, subsequent studies introduced description-first algorithms that generate better labels and achieve higher precision. Still, STC remained as the fastest SRC algorithm and there appeared studies concerned with different problems of STC. In this thesis, semantic relations between cluster labels and documents are exploited to filter out noisy labels and improve merging phase of STC. Wikipedia is used to identify these relations and methods for integrating semantic information to STC are suggested. Semantic features are shown to be effective for SRC task when used together with term frequency vectors. Furthermore, there were no SRC studies on Turkish up to now. In this thesis, a dataset for Turkish is introduced and a number of methods are tested on Turkish.

Identiferoai:union.ndltd.org:METU/oai:etd.lib.metu.edu.tr:http://etd.lib.metu.edu.tr/upload/12612554/index.pdf
Date01 September 2010
CreatorsCalli, Cagatay
ContributorsUcoluk, Gokturk
PublisherMETU
Source SetsMiddle East Technical Univ.
LanguageEnglish
Detected LanguageEnglish
TypeM.S. Thesis
Formattext/pdf
RightsTo liberate the content for public access

Page generated in 0.0017 seconds