Global ETD Search

Return to search

Measuring Semantic Distance using Distributional Profiles of Concepts

Semantic distance is a measure of how close or distant in meaning two units of language are. A large number of important natural language problems, including machine
translation and word sense disambiguation,
can be viewed as semantic distance problems.
The two dominant approaches to estimating semantic distance are the WordNet-based semantic measures and the corpus-based distributional measures. In this thesis, I compare them, both qualitatively and quantitatively, and identify the limitations of each.

This thesis argues that estimating semantic distance is essentially a property of
concepts (rather than words) and that
two concepts are semantically close if they occur in similar contexts.
Instead of identifying the co-occurrence (distributional) profiles of words (distributional hypothesis), I argue that distributional profiles of concepts (DPCs) can be used to infer the semantic properties of concepts and indeed to estimate semantic distance more accurately. I propose a new hybrid approach to calculating semantic distance that combines corpus statistics and a published thesaurus (Macquarie Thesaurus).
The algorithm determines estimates of the DPCs using the categories in the thesaurus as very coarse concepts and, notably, without requiring any sense-annotated data. Even though the use of only about 1000 concepts to represent the vocabulary of a language seems drastic, I show that the method achieves results better than the state-of-the-art in a number of natural language tasks.

I show how cross-lingual DPCs can be created by combining text in one language with a thesaurus from another. Using these cross-lingual DPCs, we can solve problems
in one, possibly resource-poor, language using a knowledge source from another,
possibly resource-rich, language. I show that the approach is also useful in tasks that inherently involve two or more languages, such as machine translation and multilingual text summarization.

The proposed approach is computationally inexpensive, it can estimate both semantic
relatedness and semantic similarity, and it can be applied to all parts of speech.
Extensive experiments on ranking word pairs as per semantic distance, real-word spelling correction, solving Reader's Digest word choice problems, determining word sense dominance, word sense disambiguation, and
word translation show that the new approach is markedly superior to previous ones.

http://hdl.handle.net/1807/11238

Computational Linguistics

Natural Language Processing

Lexical semantics

semantic distance

distributional similarity

semantic similarity

semantic relatedness

word concept co-occurrence matrix

distributional profiles of concepts

thesaurus

corpus-based techniques

word senses

cross-lingual techniques

word sense dominance

word sense disambiguation

wordnet

0984

0800

Identifer	oai:union.ndltd.org:TORONTO/oai:tspace.library.utoronto.ca:1807/11238
Date	01 August 2008
Creators	Mohammad, Saif
Contributors	Hirst, Graeme
Source Sets	University of Toronto
Language	en_US
Detected Language	English
Type	Thesis
Format	1257436 bytes, application/pdf

Page generated in 0.0018 seconds

Measuring Semantic Distance using Distributional Profiles of Concepts

Description

Links & Downloads

Tags

Additional Fields