Traditional information retrieval applications, such as Web search, return atomic units of retrieval, which are generically called ``documents''. Depending on the application, a document may be a Web page, an email message, a journal article, or any similar object. In contrast to this traditional approach, focused retrieval helps users better pin-point their exact information needs by returning results at the sub-document level. These results may consist of predefined document components~---~such as pages, sections, and paragraphs~---~or they may consist of arbitrary passages, comprising any sub-string of a document. If a document is marked up with XML, a focused retrieval system might return individual XML elements or ranges of elements. This thesis proposes and evaluates a number of approaches to focused retrieval, including methods based on XML markup and methods based on arbitrary passages. It considers the best unit of retrieval, explores methods for efficient sub-document retrieval, and evaluates formulae for sub-document scoring. Focused retrieval is also considered in the specific context of the Wikipedia, where methods for automatic vandalism detection and automatic link generation are developed and evaluated.
Identifer | oai:union.ndltd.org:WATERLOO/oai:uwspace.uwaterloo.ca:10012/5645 |
Date | January 2010 |
Creators | Itakura, Kalista Yuki |
Source Sets | University of Waterloo Electronic Theses Repository |
Language | English |
Detected Language | English |
Type | Thesis or Dissertation |
Page generated in 0.0013 seconds