Return to search

Data Selection using Topic Adaptation for Statistical Machine Translation

Statistical machine translation (SMT) requires large quantities of bitexts (i.e., bilingual parallel corpora) as training data to yield good quality translations. While obtaining a large amount of training data is critical, the similarity between training and test data also has a significant impact on SMT performance. Many SMT studies define data similarity in terms of domain-overlap, and domains are defined to be synonymous with data sources. Consequently, the SMT community has focused on domain adaptation techniques that augment small (in-domain) datasets with large datasets from other sources (hence, out-of-domain, per the definition). However, many training datasets consist of topically diverse data, and not all data contained in a single dataset are useful for translations of a specific target task. In this study, we propose a new perspective on data quality and topical similarity to enhance SMT performance. Using our data adaptation approach called topic adaptation, we select topically suitable training data corresponding to test data in order to produce better translations. We propose three topic adaptation approaches for the SMT process and investigate the effectiveness in both idealized and realistic settings using large parallel corpora. We measure performance of SMT systems trained on topically similar data and their effectiveness based on BLEU, the widely-used objective SMT performance metric. We show that topic adaptation approaches outperform baseline systems (0.3 – 3 BLEU points) when data selection parameters are carefully determined.

Identiferoai:union.ndltd.org:BGMYU2/oai:scholarsarchive.byu.edu:etd-6780
Date01 November 2015
CreatorsMatsushita, Hitokazu
PublisherBYU ScholarsArchive
Source SetsBrigham Young University
Detected LanguageEnglish
Typetext
Formatapplication/pdf
SourceAll Theses and Dissertations
Rightshttp://lib.byu.edu/about/copyright/

Page generated in 0.0019 seconds