Return to search

Phrase searching in text indexes

Phrase searching in text indexes Compare different approaches to perform phrase searching, and consider a new approach whereas bigrams is considered as index term. This master thesis focus at the challenges within phrase searching in large text indexes, and to assess alternative approaches to cope with such indexes. This goal was achieved by performing an experiment, based on the theory of using bigrams consisting of stopwords as additional index terms. Realizing the characteristics within inverted index structures, we utilized stopwords as indicators for severe long posting lists. The characteristics of stopwords proved valuable, and they were collected based on a already established index for a subset of the TREC GOV2 collection. In alternative approaches we outlined two “state of the art” index structures, specifically designed to cope with phrase searching challenges. The first structure - nextword index - followed a modification of the inverted index structure. The second structure - phrase index - utilized the inverted structure in using complete phrases as index terms. Our bigram index focused on the same manipulation of the inverted index structure as the phrase index, using bigrams of words to rastically cut posting lists lengths. This was one of our main goals, as we identified stopwords posting list lengths to be one of the primary challenges with phrase searching in inverted index structures. Using stopwords to create and select bigrams proved successful to enhance phrase searching, as response times substantially improved. We conclude that our bigram index provides a significant performance in crease in terms of query evaluation time, and outperforms the standard inverted index within phrase searching.

Identiferoai:union.ndltd.org:UPSALLA1/oai:DiVA.org:ntnu-8880
Date January 2008
CreatorsFellinghaug, Asbjørn Alexander
PublisherNorges teknisk-naturvitenskapelige universitet, Institutt for datateknikk og informasjonsvitenskap, Institutt for datateknikk og informasjonsvitenskap
Source SetsDiVA Archive at Upsalla University
LanguageEnglish
Detected LanguageEnglish
TypeStudent thesis, info:eu-repo/semantics/bachelorThesis, text
Formatapplication/pdf
Rightsinfo:eu-repo/semantics/openAccess

Page generated in 0.0164 seconds