In this paper, we propose a novel similarity measure for document data processing and apply it to text classification and clustering. For two documents, the proposed measure takes three cases into account: (a) The feature considered appears in both documents, (b) the feature considered appears in only one document, and (c) the feature considered appears in none of the documents. For the first case, we give a lower bound and decrease the similarity according to the difference between the feature values of the two documents. For the second case, we give a fixed value disregarding the magnitude of the feature value. For the last case, we ignore its effectiveness. We apply it to the similarity based single-label classifier k-NN and multi-label classifier ML-KNN, and adopt these properties to measure the similarity between a document and a specific set for document clustering, i.e., k-means like algorithm, to compare the effectiveness with other measures. Experimental results show that our proposed method can work more effectively than others.
Identifer | oai:union.ndltd.org:NSYSU/oai:NSYSU:etd-0907111-062138 |
Date | 07 September 2011 |
Creators | Gan, Zih-Dian |
Contributors | Shie-Jue Lee, Chun-Liang Hou, Chen-Sen Ouyang, Hsien-Liang Tsai, Chih-Hung Wu |
Publisher | NSYSU |
Source Sets | NSYSU Electronic Thesis and Dissertation Archive |
Language | Cholon |
Detected Language | English |
Type | text |
Format | application/pdf |
Source | http://etd.lib.nsysu.edu.tw/ETD-db/ETD-search/view_etd?URN=etd-0907111-062138 |
Rights | user_define, Copyright information available at source archive |
Page generated in 0.0048 seconds