Global ETD Search

Return to search

Detecting Near-Duplicate Documents using Sentence-Level Features and Machine Learning

From the large scale of documents effective to find the near-duplicate document, has been a very important issue. In this paper, we propose a new method to detect near-duplicate document from the large scale dataset, our method is divided into three parts, feature selection, similarity measure and discriminant derivation. In feature selection, document will be detected after preprocessed. Documents have to remove signals, stop words ... and so on. We measure the value of the term weight in the sentence, and then choose the terms which have higher weight in the sentence. These terms collected as a feature of the document. The document¡¦s feature set collected by these features. Similarity measure is based on similarity function to measure the similarity value between two feature sets. Discriminant derivation is based on support vector machine which train a classifiers to identify whether a document is a near-duplicate or not. support vector machine is a supervised learning strategy. It trains a classifier by the training patterns. In the characteristics of documents, the sentence-level features are more effective than terms-level features. Besides, learning a discriminant by SVM can avoid trial-and-error efforts required in conventional methods. Trial-and-error is going to find a threshold, a discriminant value to define document¡¦s relation. In the final analysis of experiment, our method is effective in near-duplicate document detection than other methods.

http://etd.lib.nsysu.edu.tw/ETD-db/ETD-search/view_etd?URN=etd-1023112-100138

Near-duplicate

threshold

trial-and-error

support vector machine

feature selection

stop words

similarity function

Identifer	oai:union.ndltd.org:NSYSU/oai:NSYSU:etd-1023112-100138
Date	23 October 2012
Creators	Liao, Ting-Yi
Contributors	Chen-Sen Ouyang, Chun-Liang Hou, Shie-Jue Lee, Tsung-Chuan Huang
Publisher	NSYSU
Source Sets	NSYSU Electronic Thesis and Dissertation Archive
Language	Cholon
Detected Language	English
Type	text
Format	application/pdf
Source	http://etd.lib.nsysu.edu.tw/ETD-db/ETD-search/view_etd?URN=etd-1023112-100138
Rights	user_define, Copyright information available at source archive

Page generated in 0.002 seconds

Detecting Near-Duplicate Documents using Sentence-Level Features and Machine Learning

Description

Links & Downloads

Tags

Additional Fields