Return to search

Named entity recognition : challenges in document annotation, gazetteer construction and disambiguation

The 'information explosion' has generated unprecedented amount of published information that is still growing at an astonishing rate. As the amount of information grows, the problem of managing the information becomes challenging. A key to this challenge rests on the technology of Information Extraction, which automatically transforms un-structured textual data into structured representation that can be interpreted and manipulated by machines. It is recognised that a fundamental task in Information Extraction is Named Entity Recognition, the goals of which are identifying references of named entities in unstructured documents, and classifying them into pre-defined semantic categories. Further, due to the polysemous nature of natural language, name references are often ambiguous. Resolving ambiguity concerns recognising the true referent entity of a name reference, essentially a further named entity 'recognition' step and often a compulsory process required by tasks built on top of NER. This research presents a body of work aimed at addressing three research questions for NER. The first question concerns effective and efficient methods for training data annotation, which is the task of creating essential training examples for machine learning based NER methods. The second question studies automatically generating background knowledge for NER in the form of gazetteers, which are often critical resources to improve the performance of NER methods. The third question addresses resolving ambiguous name references, a further 'recognition' step that ensures the output of NER to be usable by many complex tasks and applications. For each research question, the related literature has been carefully studied and their limitations have been identified and discussed. New hypotheses and methods have been pro-posed, leading to a number of contributions: - an approach to training data annotation for supervised NER methods, based on the study of annotator suitability and suitability based task allocation; - a method of automatically expanding existing gazetteers of pre-defined semantic categories exploiting the structure and knowledge of Wikipedia; - a method of automatically generating untyped gazetteers for NER based on the 'topic-representativeness' of words in documents; - a method of named entity disambiguation based on maximising the semantic relatedness between candidate entities in a text discourse; - a review of lexical semantic relatedness measures; and a new lexical semantic relatedness measure that harnesses knowledge from different resources. The proposed methods have been evaluated by carefully designed experiments, following the standard practice in each related research area. The results have confirmed the validity of their corresponding hypotheses, as well as the empirical effectiveness of these methods. Overall it is believed that this research has made solid contribution to the re-search of NER and related areas.

Identiferoai:union.ndltd.org:bl.uk/oai:ethos.bl.uk:570183
Date January 2013
CreatorsZhang, Ziqi
ContributorsCiravegna, Fabio
PublisherUniversity of Sheffield
Source SetsEthos UK
Detected LanguageEnglish
TypeElectronic Thesis or Dissertation
Sourcehttp://etheses.whiterose.ac.uk/19276/

Page generated in 0.0015 seconds