Global ETD Search

Return to search

The Impact of Near-Duplicate Documents on Information Retrieval Evaluation

Near-duplicate documents can adversely affect the efficiency and
effectiveness of search engines.
Due to the pairwise nature of the comparisons required for near-duplicate
detection, this process is extremely costly in terms of the time and
processing power it requires.
Despite the ubiquitous presence of near-duplicate detection algorithms
in commercial search engines, their application and impact in research
environments is not fully explored.
The implementation of near-duplicate detection algorithms forces trade-offs
between efficiency and effectiveness, entailing careful testing and
measurement to ensure acceptable performance.
In this thesis, we describe and evaluate a scalable implementation of a
near-duplicate detection algorithm, based on standard shingling techniques,
running under a MapReduce framework.
We explore two different shingle sampling techniques and analyze
their impact on the near-duplicate document detection process.
In addition, we investigate the prevalence of near-duplicate documents
in the runs submitted to the adhoc task of TREC 2009 web track.

http://hdl.handle.net/10012/5750

near-duplicate detection

MapReduce

shingles

Computer Science

Identifer	oai:union.ndltd.org:LACETR/oai:collectionscanada.gc.ca:OWTU.10012/5750
Date	18 January 2011
Creators	Khoshdel Nikkhoo, Hani
Source Sets	Library and Archives Canada ETDs Repository / Centre d'archives des thèses électroniques de Bibliothèque et Archives Canada
Language	English
Detected Language	English
Type	Thesis or Dissertation

Page generated in 0.0018 seconds

The Impact of Near-Duplicate Documents on Information Retrieval Evaluation

Description

Links & Downloads

Tags

Additional Fields