Bitext alignment is the task of aligning a text in a source language and its translation in the target language. Aligning amounts to finding the translational correspondences between textual units at different levels of granularity. Many practical natural language processing applications rely on bitext alignments to access the rich linguistic knowledge present in a bitext. While the most predominant application for bitexts is statistical machine translation, they are also used in multilingual (and monolingual) lexicography, word sense disambiguation, terminology extraction, computer-aided language learning andtranslation studies, to name a few.Bitext alignment is an arduous task because meaning is not expressed seemingly across languages. It varies along linguistic properties and cultural backgrounds of different languages, and also depends on the translation strategy that have been used to produce the bitext.Current practices in bitext alignment model the alignment as a hidden variable in the translation process. In order to reduce the complexity of the task, such approaches suppose that a word in the source sentence is aligned to one word at most in the target sentence.However, this over-simplistic assumption results in asymmetric, one-to-many alignments, whereas alignments are typically symmetric and many-to-many.To achieve symmetry, two one-to-many alignments in opposite translation directions are built and combined using a heuristic.In order to use these word alignments in phrase-based translation systems which use phrases instead of words, a heuristic is used to extract phrase pairs that are consistent with the word alignment.In this dissertation we address both the problems of word alignment and phrase pairs extraction.We improve the state of the art in several ways using discriminative learning techniques.We present a maximum entropy (MaxEnt) framework for word alignment.In this framework, links are predicted independently from one another using a MaxEnt classifier.The interaction between alignment decisions is approximated using stackingtechniques, which allows us to account for a part of the structural dependencies without increasing the complexity. This formulation can be seen as an alignment combination method,in which the union of several input alignments is used to guide the output alignment. Additionally, input alignments are used to compute a rich set of feature functions.Our MaxEnt aligner obtains state of the art results in terms of alignment quality as measured by thealignment error rate, and translation quality as measured by BLEU on large-scale Arabic-English NIST'09 systems.We also present a translation quality informed procedure for both extraction and evaluation of phrase pairs. We reformulate the problem in the supervised framework in which we decide for each phrase pair whether we keep it or not in the translation model. This offers a principled way to combine several features to make the procedure more robust to alignment difficulties. We use a simple and effective method, based on oracle decoding,to annotate phrase pairs that are useful for translation. Using machine learning techniques based on positive examples only,these annotations can be used to learn phrase alignment decisions. Using this approach we obtain improvements in BLEU scores for recall-oriented translation models, which are suitable for small training corpora.
Identifer | oai:union.ndltd.org:CCSD/oai:tel.archives-ouvertes.fr:tel-00720250 |
Date | 27 June 2012 |
Creators | Tomeh, Nadi |
Publisher | Université Paris Sud - Paris XI |
Source Sets | CCSD theses-EN-ligne, France |
Language | English |
Detected Language | English |
Type | PhD thesis |
Page generated in 0.0024 seconds