Global ETD Search

Return to search

Probabilistic Simhash Matching

Finding near-duplicate documents is an interesting problem but the existing methods are not suitable for large scale datasets and memory constrained systems. In this work, we developed approaches that tackle the problem of finding near-duplicates while improving query performance and using less memory. We then carried out an evaluation of our method on a dataset of 70M web documents, and showed that our method works really well. The results indicated that our method could achieve a reduction in space by a factor of 5 while improving the query time by a factor of 4 with a recall of 0.95 for finding all near-duplicates when the dataset is in memory. With the same recall and same reduction in space, we could achieve an improvement in query-time by a factor of 4.5 while finding first the near-duplicate for an in memory dataset. When the dataset was stored on a disk, we could achieve an improvement in performance by 7 times for finding all near-duplicates and by 14 times when finding the first near-duplicate.

http://hdl.handle.net/1969.1/ETD-TAMU-2011-08-9813

Identifer	oai:union.ndltd.org:tamu.edu/oai:repository.tamu.edu:1969.1/ETD-TAMU-2011-08-9813
Date	2011 August 1900
Creators	Sood, Sadhan
Contributors	Loguinov, Dmitri
Source Sets	Texas A and M University
Language	en_US
Detected Language	English
Type	thesis, text
Format	application/pdf

Page generated in 0.0013 seconds

Probabilistic Simhash Matching

Description

Links & Downloads

Tags

Additional Fields