Global ETD Search

Return to search

Data Selection using Topic Adaptation for Statistical Machine Translation

Statistical machine translation (SMT) requires large quantities of bitexts (i.e., bilingual parallel corpora) as training data to yield good quality translations. While obtaining a large amount of training data is critical, the similarity between training and test data also has a significant impact on SMT performance. Many SMT studies define data similarity in terms of domain-overlap, and domains are defined to be synonymous with data sources. Consequently, the SMT community has focused on domain adaptation techniques that augment small (in-domain) datasets with large datasets from other sources (hence, out-of-domain, per the definition). However, many training datasets consist of topically diverse data, and not all data contained in a single dataset are useful for translations of a specific target task. In this study, we propose a new perspective on data quality and topical similarity to enhance SMT performance. Using our data adaptation approach called topic adaptation, we select topically suitable training data corresponding to test data in order to produce better translations. We propose three topic adaptation approaches for the SMT process and investigate the effectiveness in both idealized and realistic settings using large parallel corpora. We measure performance of SMT systems trained on topically similar data and their effectiveness based on BLEU, the widely-used objective SMT performance metric. We show that topic adaptation approaches outperform baseline systems (0.3 – 3 BLEU points) when data selection parameters are carefully determined.

topic adaptation

data selection

statistical machine translation

Computer Sciences

Identifer	oai:union.ndltd.org:BGMYU2/oai:scholarsarchive.byu.edu:etd-6780
Date	01 November 2015
Creators	Matsushita, Hitokazu
Publisher	BYU ScholarsArchive
Source Sets	Brigham Young University
Detected Language	English
Type	text
Format	application/pdf
Source	All Theses and Dissertations
Rights	http://lib.byu.edu/about/copyright/

Page generated in 0.0021 seconds

Data Selection using Topic Adaptation for Statistical Machine Translation

Description

Links & Downloads

Tags

Additional Fields