Return to search

The Impact of Near-Duplicate Documents on Information Retrieval Evaluation

Near-duplicate documents can adversely affect the efficiency and
effectiveness of search engines.
Due to the pairwise nature of the comparisons required for near-duplicate
detection, this process is extremely costly in terms of the time and
processing power it requires.
Despite the ubiquitous presence of near-duplicate detection algorithms
in commercial search engines, their application and impact in research
environments is not fully explored.
The implementation of near-duplicate detection algorithms forces trade-offs
between efficiency and effectiveness, entailing careful testing and
measurement to ensure acceptable performance.
In this thesis, we describe and evaluate a scalable implementation of a
near-duplicate detection algorithm, based on standard shingling techniques,
running under a MapReduce framework.
We explore two different shingle sampling techniques and analyze
their impact on the near-duplicate document detection process.
In addition, we investigate the prevalence of near-duplicate documents
in the runs submitted to the adhoc task of TREC 2009 web track.

Identiferoai:union.ndltd.org:WATERLOO/oai:uwspace.uwaterloo.ca:10012/5750
Date18 January 2011
CreatorsKhoshdel Nikkhoo, Hani
Source SetsUniversity of Waterloo Electronic Theses Repository
LanguageEnglish
Detected LanguageEnglish
TypeThesis or Dissertation

Page generated in 0.0795 seconds