Global ETD Search

Return to search

The Impact of Near-Duplicate Documents on Information Retrieval Evaluation

Near-duplicate documents can adversely affect the efficiency and
effectiveness of search engines.
Due to the pairwise nature of the comparisons required for near-duplicate
detection, this process is extremely costly in terms of the time and
processing power it requires.
Despite the ubiquitous presence of near-duplicate detection algorithms
in commercial search engines, their application and impact in research
environments is not fully explored.
The implementation of near-duplicate detection algorithms forces trade-offs
between efficiency and effectiveness, entailing careful testing and
measurement to ensure acceptable performance.
In this thesis, we describe and evaluate a scalable implementation of a
near-duplicate detection algorithm, based on standard shingling techniques,
running under a MapReduce framework.
We explore two different shingle sampling techniques and analyze
their impact on the near-duplicate document detection process.
In addition, we investigate the prevalence of near-duplicate documents
in the runs submitted to the adhoc task of TREC 2009 web track.

http://hdl.handle.net/10012/5750

near-duplicate detection

MapReduce

shingles

Computer Science

Identifer	oai:union.ndltd.org:WATERLOO/oai:uwspace.uwaterloo.ca:10012/5750
Date	18 January 2011
Creators	Khoshdel Nikkhoo, Hani
Source Sets	University of Waterloo Electronic Theses Repository
Language	English
Detected Language	English
Type	Thesis or Dissertation

Page generated in 0.0021 seconds

The Impact of Near-Duplicate Documents on Information Retrieval Evaluation

Description

Links & Downloads

Tags

Additional Fields