Global ETD Search

Return to search

A Document Similarity Measure and Its Applications

In this paper, we propose a novel similarity measure for document data processing and apply it to text classification and clustering. For two documents, the proposed measure takes three cases into account: (a) The feature considered appears in both documents, (b) the feature considered appears in only one document, and (c) the feature considered appears in none of the documents. For the first case, we give a lower bound and decrease the similarity according to the difference between the feature values of the two documents. For the second case, we give a fixed value disregarding the magnitude of the feature value. For the last case, we ignore its effectiveness. We apply it to the similarity based single-label classifier k-NN and multi-label classifier ML-KNN, and adopt these properties to measure the similarity between a document and a specific set for document clustering, i.e., k-means like algorithm, to compare the effectiveness with other measures. Experimental results show that our proposed method can work more effectively than others.

http://etd.lib.nsysu.edu.tw/ETD-db/ETD-search/view_etd?URN=etd-0907111-062138

Identifer	oai:union.ndltd.org:NSYSU/oai:NSYSU:etd-0907111-062138
Date	07 September 2011
Creators	Gan, Zih-Dian
Contributors	Shie-Jue Lee, Chun-Liang Hou, Chen-Sen Ouyang, Hsien-Liang Tsai, Chih-Hung Wu
Publisher	NSYSU
Source Sets	NSYSU Electronic Thesis and Dissertation Archive
Language	Cholon
Detected Language	English
Type	text
Format	application/pdf
Source	http://etd.lib.nsysu.edu.tw/ETD-db/ETD-search/view_etd?URN=etd-0907111-062138
Rights	user_define, Copyright information available at source archive

Page generated in 0.0048 seconds

A Document Similarity Measure and Its Applications

Description

Links & Downloads

Tags

Additional Fields