Return to search

Searching and ranking structured documents

It is common to see documents with explicit structure marked up in languages such as XML. Queries, on the other hand, typically have no structure. There is a clear mismatch, although documents contain structure it is typically not used in information retrieval.
An efficient index structure for document-centric searching is proposed and its efficiency is discussed. It is shown to be at worst linear with respect to the number of occurrences of a given search term. The algorithm is then extended to accommodate element-centric information retrieval.
Ranking algorithms for structured documents are examined. Genetic Algorithms are used to learn different weights for each structure present in a document. Applying these weights as part of a function is shown to yield significant precision improvements in some functions. Genetic Programming is then used to learn an entire ranking function. This function is shown to be portable between document collections.
A query language for structured information retrieval is proposed. Use of this language in the 2004 INEX workshop resulted in a large decrease in query errors.
Structured information retrieval is now a viable alternative to its unstructured counterpart. A successful query language, efficient indexing structures, and improved ranking functions are all presented.

Identiferoai:union.ndltd.org:ADTP/217501
Date January 2007
CreatorsTrotman, Andrew, n/a
PublisherUniversity of Otago. Department of Computer Science
Source SetsAustraliasian Digital Theses Program
LanguageEnglish
Detected LanguageEnglish
Rightshttp://policy01.otago.ac.nz/policies/FMPro?-db=policies.fm&-format=viewpolicy.html&-lay=viewpolicy&-sortfield=Title&Type=Academic&-recid=33025&-find), Copyright Andrew Trotman

Page generated in 0.0017 seconds