Global ETD Search

Return to search

Near-Duplicate Detection Using Instance Level Constraints

For the task of near-duplicate document detection, comparison approaches based on bag-of-words used in information retrieval community are not sufficiently accurate. This work presents novel approach when instance-level constraints are given for documents and it is needed to retrieve them, given new query document for near-duplicate detection. The framework incorporates instance-level constraints and clusters documents into groups using novel clustering approach Grouped Latent Dirichlet Allocation (gLDA). Then distance metric is learned for each cluster using large margin nearest neighbor algorithm and finally ranked documents for given new unknown document using learnt distance metrics. The variety of experimental results on various datasets demonstrate that our clustering method (gLDA with side constraints) performs better than other clustering methods and the overall approach outperforms other near-duplicate detection algorithms.

Latent Dirichlet Allocation

Information Retrieval

Near-Duplicate Detection

Constrained Clustering

Group LDA

Duplicate Bug Report Detection

Near-Duplicate Document Detection

Computer Science

Identifer	oai:union.ndltd.org:IISc/oai:etd.ncsi.iisc.ernet.in:2005/1346
Date	08 1900
Creators	Patel, Vishal
Contributors	Bhattacharyya, Chiranjib
Source Sets	India Institute of Science
Language	en_US
Detected Language	English
Type	Thesis
Relation	G23536

Page generated in 0.0018 seconds

Near-Duplicate Detection Using Instance Level Constraints

Description

Links & Downloads

Tags

Additional Fields