Global ETD Search

Return to search

Designing a general framework for text alignment : case studies with two South Asian languages

Building machine translation systems for many South Asian languages (such as Hindi, Gujarati, etc.) using statistical methods is problematic. The primary reason is insufficient parallel data to learn accurate word alignment. Additionally, these languages are morphologically rich and have free word order. When it is difficult to rely purely on statistical methods due to insufficient data, research shows that better performance can be obtained by building hybrid systems that rely on language specific resources, such as morphological analysers or dictionaries, as well as statistical methods. However, it is difficult to find such language specific resources for many South Asian languages. Since languages such as Hindi, Gujarati, Urdu, Bengali, Punjabi and Marathi are all very similar in structure and the main differences lie in the script and vocabulary used for these languages, we hypothesise that it is possible to develop resources for one of these languages and generalize the approach to allow rapid bootstrapping of similar resources for the other closely related languages -- with minimal effort and similar accuracies. To verify this, we develop a few resources for the Hindi language, including a sentence alignment algorithm, a morphological analyser and a transliteration similarity component and generalize the approach to allow rapid bootstrapping of similar resources for the Gujarati language. We show that the approach works on both the Hindi and Gujarati languages and achieves results that are comparable to similar state-of-the-art (SOA) resources available for these languages. We also hypothesise that it is possible to develop a high performance hybrid word alignment algorithm that relies on such language specific resources. To verify this, we design, implement and evaluate a novel English-Hindi hybrid word alignment system that uses the Hindi specific resources developed by us. Not only do we show our word alignment system outperforms other SOA English-Hindi word alignment systems, but also how simple it is to adapt it to the English-Gujarati language pair.

http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.557559

491.4

Identifer	oai:union.ndltd.org:bl.uk/oai:ethos.bl.uk:557559
Date	January 2012
Creators	Aswani, Niraj
Contributors	Gaizauskas, Robert
Publisher	University of Sheffield
Source Sets	Ethos UK
Detected Language	English
Type	Electronic Thesis or Dissertation
Source	http://etheses.whiterose.ac.uk/2618/

Page generated in 0.002 seconds

Designing a general framework for text alignment : case studies with two South Asian languages

Description

Links & Downloads

Tags

Additional Fields