The detection of new information in a document stream is an important component of many potential applications. In this thesis, a new novelty detection approach based on the identification of sentence level information patterns is proposed. Given a user's information need, some information patterns in sentences such as combinations of query words, sentence lengths, named entities and phrases, and other sentence patterns, may contain more important and relevant information than single words. The work of the thesis includes three parts. First, we redefine "what is novelty detection" in the lights of the proposed information patterns. Examples of several different types of information patterns are given corresponding to different types of uses' information need. Second, we analyze why the proposed information pattern concept has a significant impact in novelty detection. A thorough analysis of sentence level information patterns is elaborated on data from the TREC novelty tracks, including sentence lengths, named entities (NEs), and sentence level opinion patterns. Finally, we present how we perform novelty detection based on information patterns, which focuses on the identification of previously unseen query-related patterns in sentences. A unified pattern-based approach is presented to novelty detection for both specific NE topics and more general topics. Experiments on novelty detection were carried out on data from the TREC 2002, 2003 and 2004 novelty tracks. Experimental results show that the proposed approach significantly improves the performance of novelty detection for both specific and general topics, therefore the overall performance for all topics, in terms of precision at top ranks. Future research directions are suggested.
Identifer | oai:union.ndltd.org:UMASS/oai:scholarworks.umass.edu:dissertations-4433 |
Date | 01 January 2006 |
Creators | Li, Xiaoyan |
Publisher | ScholarWorks@UMass Amherst |
Source Sets | University of Massachusetts, Amherst |
Language | English |
Detected Language | English |
Type | text |
Source | Doctoral Dissertations Available from Proquest |
Page generated in 0.0017 seconds