Global ETD Search

Return to search

Modeling Relevance in Statistical Machine Translation: Scoring Alignment, Context, and Annotations of Translation Instances

Machine translation has advanced considerably in recent years, primarily due to the availability of larger datasets. However, one cannot rely on the availability of copious, high-quality bilingual training data. In this work, we improve upon the state-of-the-art in machine translation with an instance-based model that scores each instance of translation in the corpus. A translation instance reflects a source and target correspondence at one specific location in the corpus. The significance of this approach is that our model is able to capture that some instances of translation are more relevant than others.
We have implemented this approach in Cunei, a new platform for machine translation that permits the scoring of instance-specific features. Leveraging per-instance alignment features, we demonstrate that Cunei can outperform Moses, a widely-used machine translation system. We then expand on this baseline system in three principal directions, each of which shows further gains. First, we score the source context of a translation instance in order to favor those that are most similar to the input sentence. Second, we apply similar techniques to score the target context of a translation instance and favor those that are most similar to the target hypothesis. Third, we provide a mechanism to mark-up the corpus with annotations (e.g. statistical word clustering, part-of-speech labels, and parse trees) and then exploit this information to create additional perinstance similarity features. Each of these techniques explicitly takes advantage of the fact that our approach scores each instance of translation on demand after the input sentence is provided and while the target hypothesis is being generated; similar extensions would be impossible or quite difficult in existing machine translation systems.
Ultimately, this approach provides a more exible framework for integration of novel features that adapts better to new data. In our experiments with German-English and Czech-English translation, the addition of instance-specific features consistently shows improvement.

Computational Linguistics

Identifer	oai:union.ndltd.org:cmu.edu/oai:repository.cmu.edu:dissertations-1142
Date	01 January 2012
Creators	Phillips, Aaron B.
Publisher	Research Showcase @ CMU
Source Sets	Carnegie Mellon University
Detected Language	English
Type	text
Format	application/pdf
Source	Dissertations

Page generated in 0.0023 seconds

Modeling Relevance in Statistical Machine Translation: Scoring Alignment, Context, and Annotations of Translation Instances

Description

Links & Downloads

Tags

Additional Fields