Return to search

Dynamic indexes vs. static hierarchies for substring search

<p>This report explores the problem of substring search in a dynamic document set. The operations supported are document inclusion, document removal and queries. This is a well explored field for word indexes, but not for substring indexes. The contributions of this report is the exploration of a multi-document dynamic suffix tree (MDST), which is compared with using a hierarchy of static indexes using suffix arrays. Only memory resident data structures are explored. The concept of a ``generalised suffix tree'', indexing a static set of strings, is used in bioinformatics. The implemented data structure adds online document inclusion, update and removal, linear on the single document size. Various models for the hierarchy of static indexes is explored, some which of give faster update, and some faster search. For the static suffix arrays, the BPR cite{SS05} construction algorithm is used, which is the fastest known. This algorithm is about 3-4 times faster than the implemented suffix tree construction. Two tricks for speeding up search and hit reporting in the suffix array are also explored: Using a start index for the binary search, and a direct map of global addresses to document IDs and local addresses. The tests show that the MDST is much faster than the hierarchic indexes when the index freshness requirement is absolute, and the documents are small. The tree uses about three times as much memory as the suffix arrays. When there is a large number of hits, the suffix arrays are slightly faster on reporting hits, as there they have better memory locality. If you have enough primary memory, the MDST seems to be the best choice in general.</p>

Identiferoai:union.ndltd.org:UPSALLA/oai:DiVA.org:ntnu-9225
Date January 2005
CreatorsGrimsmo, Nils
PublisherNorwegian University of Science and Technology, Department of Computer and Information Science, Institutt for datateknikk og informasjonsvitenskap
Source SetsDiVA Archive at Upsalla University
LanguageEnglish
Detected LanguageEnglish
TypeStudent thesis, text

Page generated in 0.0018 seconds