This dissertation presents a new approach to solving the coreference resolution problem for a natural language processing (NLP) task known as information extraction. It describes a new system, named R scESOLVE, that uses machine learning techniques to determine when two phrases in a test co-refer, i.e., refer to the same thing. R scESOLVE can be used as a component within an information extraction system--a system that extracts information automatically from a corpus of texts that all focus on the same topic area--or it can be used as a stand-alone system to evaluate the relative contribution of different types of knowledge to the coreference resolution process. R scESOLVE represents an improvement over previous approaches to the coreference resolution problem, in that it uses a machine learning algorithm to handle some of the work that had previously been performed manually by a knowledge engineer. R scESOLVE can achieve performance that is as good as a system that was manually constructed for the same task, when both systems are given access to the same knowledge and tested on the same data. The machine learning algorithm used by R scESOLVE can be given access to different types of knowledge, some portions of which are very specific to a particular topic area or domain, and other portions are more general or domain-independent. An ablation experiment shows that domain-specific knowledge is very important to coreference resolution--the performance degradation when the domain-specific features are disabled is significantly worse than when a similarly-sized set of domain-independent features is disabled. However, even though domain-specific knowledge is important for coreference resolution, domain-independent features alone enable R scESOLVE to achieve 80% of the performance it achieves when domain-specific features are available. One explanation for why domain-independent knowledge can be used so effectively is illustrated in another domain, where the machine learning algorithm discovers domain-specific knowledge by assembling the domain-independent features of knowledge into domain-specific patterns. This ability of R scESOLVE to compensate for missing or insufficient domain-specific knowledge is a significant advantage for redeploying the system in new domains.
Identifer | oai:union.ndltd.org:UMASS/oai:scholarworks.umass.edu:dissertations-2791 |
Date | 01 January 1996 |
Creators | McCarthy, Joseph Francis |
Publisher | ScholarWorks@UMass Amherst |
Source Sets | University of Massachusetts, Amherst |
Language | English |
Detected Language | English |
Type | text |
Source | Doctoral Dissertations Available from Proquest |
Page generated in 0.0018 seconds