Return to search

Generation and application of semantic networks from plain text and Wikipedia

Natural Language Processing systems crucially depend on the availability of lexical and conceptual knowledge representations. They need to be able to disambiguate word senses and detect synonyms. In order to draw inferences, they require access to hierarchical relations between concepts (dog isAn animal) as well as non-hierarchical ones (gasoline fuels car). Knowledge resources such as lexical databases, semantic networks and ontologies explicitly encode such conceptual knowledge. However, traditionally, these have been manually created, which is expensive and time consuming for large re- sources, and cannot provide adequate coverage in specialised domains. In order to alleviate this acquisition bottleneck, statistical methods have been created to acquire lexical and conceptual knowledge automatically from text. In particular, unsupervised techniques have the advantage that they can be easily adapted to any domain, given some corpus on the topic. However, due to sparseness issues, they often require very large corpora to achieve high quality results. The spectrum of resources and statistical methods has a crucial gap in situations when manually cre- ated resources do not provide the necessary coverage and only limited corpora are available. This is the case for real-world domain applications such as an NLP system for processing technical information based on a limited amount of company documentation. We provide a large-scale demonstration that this gap can be filled through the use of automatically generated networks. The corpus is automatically transformed into a network representing the terms or concepts which occur in the text and their relations, based entirely on linguistic tools. The net- works structurally lie in between the unstructured corpus and the highly structured manually created resources. We show that they can be useful in situations for which neither existing approach is ap- plicable. In contrast to manually created resources, our networks can be generated quickly and on demand. Conversely, they make it possible to achieve higher quality representations from less text than corpus-based methods, relieving the requirement of very large scale corpora. We devise scaleable frameworks for building networks from plain text and Wikipedia with varying levels of expressiveness. This work creates concrete networks from the entire British National Corpus covering 1.2m terms and 21m relations and a Wikipedia network covering 2.7m concepts. We develop a network-based semantic space model and evaluate it on the task of measuring semantic relatedness. In addition, noun compound paraphrasing is tackled to demonstrate the quality of the indirect paths in the network for concept relation description. On both evaluations we achieve results competitive to the state of the art. In particular, our network-based methods outperform corpus-based methods, demonstrating the gain created by leveraging the network structure.

Identiferoai:union.ndltd.org:bl.uk/oai:ethos.bl.uk:588383
Date January 2012
CreatorsWojtinnek, Pia-Ramona
ContributorsPulman, Stephen
PublisherUniversity of Oxford
Source SetsEthos UK
Detected LanguageEnglish
TypeElectronic Thesis or Dissertation
Sourcehttp://ora.ox.ac.uk/objects/uuid:8b9e1aab-ff11-45a4-b321-e95cd2cb4a30

Page generated in 0.0027 seconds