Global ETD Search

Return to search

Redistribution of Documents across Search Engine Clusters

The goal of this master thesis has been to evaluate methods for redistribution of data on search engine clusters. For all of the methods the redistribution is done when the cluster changes size. Redistribution methods that are specifically designed for search engines are not common, so the methods compared in this thesis are based on other distributed settings. This is from among other things distributed database systems, distributed files and continuous media systems. The evaluation of the methods consists of two parts, a theoretical analysis and an implementation and testing of the methods. In the theoretical analysis the methods are compared by deduction of expressions of performance. In the practical approach the algorithms are implemented on a simplified search engine cluster of 6 computers. The methods have been evaluated using three criteria. The first criteria of evaluation are how well the methods distribute documents across the cluster. In the theoretical analysis this also includes worst case scenarios. The practical evaluation compares the distribution at the end of the tests. The second criterion of evaluation is efficiency of document access. The theoretical approach focuses on the number of operations required while the practical approach calculates indexing throughput. The last area of focus examined is the document volume transported during redistribution. For the final part of the comparison of the methods, some relevant scenarios are introduced. These scenarios focus on dynamic data sets with high frequency of updates, often new documents and much searching. Using the scenarios and results from the method testing, we found some methods that performed be better than others. It is worth noting that the conclusions are for a given the type of workload from the scenarios and the setting for the test. Given other situations, other methods might be more suitable. When concluding our results we found, for the give scenarios, the best distribution method was the distributed version of linear hashing (LH*). The results from the method using hashing/range-partitioning also showed to be the least suitable as a consequence of high transport volume.

ntnudaim

MIT informatikk

Informasjonsforvaltning

Identifer	oai:union.ndltd.org:UPSALLA1/oai:DiVA.org:ntnu-8967
Date	January 2009
Creators	Høyum, Øystein
Publisher	Norges teknisk-naturvitenskapelige universitet, Institutt for datateknikk og informasjonsvitenskap, Institutt for datateknikk og informasjonsvitenskap
Source Sets	DiVA Archive at Upsalla University
Language	English
Detected Language	English
Type	Student thesis, info:eu-repo/semantics/bachelorThesis, text
Format	application/pdf
Rights	info:eu-repo/semantics/openAccess

Page generated in 0.0015 seconds

Redistribution of Documents across Search Engine Clusters

Description

Links & Downloads

Tags

Additional Fields