Nowadays several indicators suggest that the statistical approach to machinetranslation is the most promising. It allows fast development of systems for anylanguage pair provided that sufficient training data is available.Statistical Machine Translation (SMT) systems use parallel texts ‐ also called bitexts ‐ astraining material for creation of the translation model and monolingual corpora fortarget language modeling.The performance of an SMT system heavily depends upon the quality and quantity ofavailable data. In order to train the translation model, the parallel texts is collected fromvarious sources and domains. These corpora are usually concatenated, word alignmentsare calculated and phrases are extracted.However, parallel data is quite inhomogeneous in many practical applications withrespect to several factors like data source, alignment quality, appropriateness to thetask, etc. This means that the corpora are not weighted according to their importance tothe domain of the translation task. Therefore, it is the domain of the training resourcesthat influences the translations that are selected among several choices. This is incontrast to the training of the language model for which well‐known techniques areused to weight the various sources of texts.We have proposed novel methods to automatically weight the heterogeneous data toadapt the translation model.In a first approach, this is achieved with a resampling technique. A weight to eachbitexts is assigned to select the proportion of data from that corpus. The alignmentscoming from each bitexts are resampled based on these weights. The weights of thecorpora are directly optimized on the development data using a numerical method.Moreover, an alignment score of each aligned sentence pair is used as confidencemeasurement.In an extended work, we obtain such a weighting by resampling alignments usingweights that decrease with the temporal distance of bitexts to the test set. By thesemeans, we can use all the available bitexts and still put an emphasis on the most recentone. The main idea of our approach is to use a parametric form or meta‐weights for theweighting of the different parts of the bitexts. This ensures that our approach has onlyfew parameters to optimize.In another work, we have proposed a generic framework which takes into account thecorpus and sentence level "goodness scores" during the calculation of the phrase‐tablewhich results into better distribution of probability mass of the individual phrase pairs.
Identifer | oai:union.ndltd.org:CCSD/oai:tel.archives-ouvertes.fr:tel-00718226 |
Date | 29 June 2012 |
Creators | Shah, Kashif |
Publisher | Université du Maine |
Source Sets | CCSD theses-EN-ligne, France |
Language | English |
Detected Language | English |
Type | PhD thesis |
Page generated in 0.0022 seconds