From the large scale of documents effective to find the near-duplicate document, has been a very important issue. In this paper, we propose a new method to detect near-duplicate document from the large scale dataset, our method is divided into three parts, feature selection, similarity measure and discriminant derivation. In feature selection, document will be detected after preprocessed. Documents have to remove signals, stop words ... and so on. We measure the value of the term weight in the sentence, and then choose the terms which have higher weight in the sentence. These terms collected as a feature of the document. The document¡¦s feature set collected by these features. Similarity measure is based on similarity function to measure the similarity value between two feature sets. Discriminant derivation is based on support vector machine which train a classifiers to identify whether a document is a near-duplicate or not. support vector machine is a supervised learning strategy. It trains a classifier by the training patterns. In the characteristics of documents, the sentence-level features are more effective than terms-level features. Besides, learning a discriminant by SVM can avoid trial-and-error efforts required in conventional methods. Trial-and-error is going to find a threshold, a discriminant value to define document¡¦s relation. In the final analysis of experiment, our method is effective in near-duplicate document detection than other methods.
Identifer | oai:union.ndltd.org:NSYSU/oai:NSYSU:etd-1023112-100138 |
Date | 23 October 2012 |
Creators | Liao, Ting-Yi |
Contributors | Chen-Sen Ouyang, Chun-Liang Hou, Shie-Jue Lee, Tsung-Chuan Huang |
Publisher | NSYSU |
Source Sets | NSYSU Electronic Thesis and Dissertation Archive |
Language | Cholon |
Detected Language | English |
Type | text |
Format | application/pdf |
Source | http://etd.lib.nsysu.edu.tw/ETD-db/ETD-search/view_etd?URN=etd-1023112-100138 |
Rights | user_define, Copyright information available at source archive |
Page generated in 0.002 seconds