Return to search

The theory of extended topic and its application in information retrieval

This thesis analyses the structure of natural language queries to document repositories, with the aim of finding better methods for information retrieval. The exponential increase of information on the Web and in other large document repositories during recent decades motivates research on facilitating the process of finding relevant information to meet end users' information needs. A shared problem among several related research areas, such as information retrieval, text summarisation and question answering, is to derive concise textual expressions to describe what a document is about, to function as the bridge between queries and the document content. In current approaches, such textual expressions are typically generated by shallow features, for example, by simply selecting a few most-frequently- occurring key words. However, such approaches are inadequate to generate expressions that truly resemble user queries. The study of what a document is about is closely related to the widely discussed notion of topic, which is defined in many different ways in theoretical linguistics as well as in practical natural language processing research. We compare these different definitions and analyse how they differ from user queries. The main function of a query is that it defines which facts are relevant in some underlying knowledge base. We show that, to serve this purpose, queries are typically formulated by first (a) specifying a focused entity and then (b) defining a perspective from which the entity is approached. For example, in the query 'history of Britain', 'Britain' is the focused entity and 'history' is the perspective. Existing theories of topic often focus on (a) and leave out (b). We develop a theory of extended topic to formalise this distinction. We demonstrate the distinction in experiments with real life topic expressions, such as WH-questions and phrases describing plans of academic papers. The theory of extended topic could be applied to help various application areas, including knowledge organisation and generating titles, etc. We focus on applying the theory to the problem of information retrieval from a document repository. Currently typical information retrieval systems retrieve relevant documents to a query by counting numbers of key word matches between a document and the query. This approach is better suited to retrieving the focused entities than the perspectives. We aim to improve the performance of information retrieval by providing better support for perspectives. To do so, we further subdivide the perspectives into different types and present different approaches to addressing each type. We illustrate our approaches with three example perspectives: 'cause', 'procedure' and 'biography'. Experiments on retrieving causal, procedural and biographical questions achieve better results than the traditional key-word-matching-based approach.

Identiferoai:union.ndltd.org:bl.uk/oai:ethos.bl.uk:589994
Date January 2012
CreatorsYin, Ling
PublisherUniversity of Brighton
Source SetsEthos UK
Detected LanguageEnglish
TypeElectronic Thesis or Dissertation
Sourcehttps://research.brighton.ac.uk/en/studentTheses/957adc51-7be2-45a3-8207-f07831f7310e

Page generated in 0.0024 seconds