Return to search

Detekce duplicit v rozsáhlých webových bázích dat / Detection of Duplicates in Huge Web Databases

This master thesis analyses the methods used for duplicity document detection and possibilities of their integration with a web search engine. It offers an overview of commonly used methods, from which it chooses the method of approximation of the Jaccard similarity measure in combination with shingling. The chosen method is adapted for implementation in the Egothor web search engine environment. The aim of the thesis is to present this implementation, describe its features, and find the most suitable parameters for the detection to run in real time. An important feature of the described method is also the possibility to make dynamic changes over the collection of indexed documents.

Identiferoai:union.ndltd.org:nusl.cz/oai:invenio.nusl.cz:313849
Date January 2012
CreatorsSadloň, Vladimír
ContributorsGalamboš, Leo, Kopecký, Michal
Source SetsCzech ETDs
LanguageSlovak
Detected LanguageEnglish
Typeinfo:eu-repo/semantics/masterThesis
Rightsinfo:eu-repo/semantics/restrictedAccess

Page generated in 0.0019 seconds