Return to search

A Document Similarity Measure and Its Applications

In this paper, we propose a novel similarity measure for document data processing and apply it to text classification and clustering. For two documents, the proposed measure takes three cases into account: (a) The feature considered appears in both documents, (b) the feature considered appears in only one document, and (c) the feature considered appears in none of the documents. For the first case, we give a lower bound and decrease the similarity according to the difference between the feature values of the two documents. For the second case, we give a fixed value disregarding the magnitude of the feature value. For the last case, we ignore its effectiveness. We apply it to the similarity based single-label classifier k-NN and multi-label classifier ML-KNN, and adopt these properties to measure the similarity between a document and a specific set for document clustering, i.e., k-means like algorithm, to compare the effectiveness with other measures. Experimental results show that our proposed method can work more effectively than others.

Identiferoai:union.ndltd.org:NSYSU/oai:NSYSU:etd-0907111-062138
Date07 September 2011
CreatorsGan, Zih-Dian
ContributorsShie-Jue Lee, Chun-Liang Hou, Chen-Sen Ouyang, Hsien-Liang Tsai, Chih-Hung Wu
PublisherNSYSU
Source SetsNSYSU Electronic Thesis and Dissertation Archive
LanguageCholon
Detected LanguageEnglish
Typetext
Formatapplication/pdf
Sourcehttp://etd.lib.nsysu.edu.tw/ETD-db/ETD-search/view_etd?URN=etd-0907111-062138
Rightsuser_define, Copyright information available at source archive

Page generated in 0.0025 seconds