Spelling suggestions: "subject:"phrase based translation"" "subject:"thrase based translation""
1 |
Refinements in hierarchical phrase-based translation systemsPino, Juan Miguel January 2015 (has links)
The relatively recently proposed hierarchical phrase-based translation model for statistical machine translation (SMT) has achieved state-of-the-art performance in numerous recent translation evaluations. Hierarchical phrase-based systems comprise a pipeline of modules with complex interactions. In this thesis, we propose refinements to the hierarchical phrase-based model as well as improvements and analyses in various modules for hierarchical phrase-based systems. We took the opportunity of increasing amounts of available training data for machine translation as well as existing frameworks for distributed computing in order to build better infrastructure for extraction, estimation and retrieval of hierarchical phrase-based grammars. We design and implement grammar extraction as a series of Hadoop MapReduce jobs. We store the resulting grammar using the HFile format, which offers competitive trade-offs in terms of efficiency and simplicity. We demonstrate improvements over two alternative solutions used in machine translation. The modular nature of the SMT pipeline, while allowing individual improvements, has the disadvantage that errors committed by one module are propagated to the next. This thesis alleviates this issue between the word alignment module and the grammar extraction and estimation module by considering richer statistics from word alignment models in extraction. We use alignment link and alignment phrase pair posterior probabilities for grammar extraction and estimation and demonstrate translation improvements in Chinese to English translation. This thesis also proposes refinements in grammar and language modelling both in the context of domain adaptation and in the context of the interaction between first-pass decoding and lattice rescoring. We analyse alternative strategies for grammar and language model cross-domain adaptation. We also study interactions between first-pass and second-pass language model in terms of size and n-gram order. Finally, we analyse two smoothing methods for large 5-gram language model rescoring. The last two chapters are devoted to the application of phrase-based grammars to the string regeneration task, which we consider as a means to study the fluency of machine translation output. We design and implement a monolingual phrase-based decoder for string regeneration and achieve state-of-the-art performance on this task. By applying our decoder to the output of a hierarchical phrase-based translation system, we are able to recover the same level of translation quality as the translation system.
|
2 |
Analýza chyb a možností zlepšení frázového strojového překladu z angličtiny do urdštiny / Analýza chyb a možností zlepšení frázového strojového překladu z angličtiny do urdštinyAta, Naila January 2010 (has links)
No description available.
|
3 |
Analýza chyb a možností zlepšení frázového strojového překladu z angličtiny do urdštinyAta, Naila January 2011 (has links)
This thesis evaluates the translation quality of phrase-based machine translation system. It explains the translation error annotation scheme to manually annotate errors related to English to Urdu translation system. The primary goal of the thesis is to experiment with different heuristic in order to improve the translation quality based on through manual analysis of 200 test sentences. Different hueristics such as (1) pre-processing of input English, such as word reordering, (2) preprocessing the training corpus in order to improve word alignment, (3) using additional factors (in Moses factored translation) to better model target-side morphological coherence are applied and their impact on the translation quality is evaluated.
|
4 |
Discriminative Alignment Models For Statistical Machine TranslationTomeh, Nadi 27 June 2012 (has links) (PDF)
Bitext alignment is the task of aligning a text in a source language and its translation in the target language. Aligning amounts to finding the translational correspondences between textual units at different levels of granularity. Many practical natural language processing applications rely on bitext alignments to access the rich linguistic knowledge present in a bitext. While the most predominant application for bitexts is statistical machine translation, they are also used in multilingual (and monolingual) lexicography, word sense disambiguation, terminology extraction, computer-aided language learning andtranslation studies, to name a few.Bitext alignment is an arduous task because meaning is not expressed seemingly across languages. It varies along linguistic properties and cultural backgrounds of different languages, and also depends on the translation strategy that have been used to produce the bitext.Current practices in bitext alignment model the alignment as a hidden variable in the translation process. In order to reduce the complexity of the task, such approaches suppose that a word in the source sentence is aligned to one word at most in the target sentence.However, this over-simplistic assumption results in asymmetric, one-to-many alignments, whereas alignments are typically symmetric and many-to-many.To achieve symmetry, two one-to-many alignments in opposite translation directions are built and combined using a heuristic.In order to use these word alignments in phrase-based translation systems which use phrases instead of words, a heuristic is used to extract phrase pairs that are consistent with the word alignment.In this dissertation we address both the problems of word alignment and phrase pairs extraction.We improve the state of the art in several ways using discriminative learning techniques.We present a maximum entropy (MaxEnt) framework for word alignment.In this framework, links are predicted independently from one another using a MaxEnt classifier.The interaction between alignment decisions is approximated using stackingtechniques, which allows us to account for a part of the structural dependencies without increasing the complexity. This formulation can be seen as an alignment combination method,in which the union of several input alignments is used to guide the output alignment. Additionally, input alignments are used to compute a rich set of feature functions.Our MaxEnt aligner obtains state of the art results in terms of alignment quality as measured by thealignment error rate, and translation quality as measured by BLEU on large-scale Arabic-English NIST'09 systems.We also present a translation quality informed procedure for both extraction and evaluation of phrase pairs. We reformulate the problem in the supervised framework in which we decide for each phrase pair whether we keep it or not in the translation model. This offers a principled way to combine several features to make the procedure more robust to alignment difficulties. We use a simple and effective method, based on oracle decoding,to annotate phrase pairs that are useful for translation. Using machine learning techniques based on positive examples only,these annotations can be used to learn phrase alignment decisions. Using this approach we obtain improvements in BLEU scores for recall-oriented translation models, which are suitable for small training corpora.
|
5 |
On improving natural language processing through phrase-based and one-to-one syntactic algorithmsMeyer, Christopher Henry January 1900 (has links)
Master of Science / Department of Computing and Information Sciences / William H. Hsu / Machine Translation (MT) is the practice of using computational methods to convert words from one natural language to another. Several approaches have been created since MT’s inception in the 1950s and, with the vast increase in computational resources since then, have continued to evolve and improve. In this thesis I summarize several branches of MT theory and introduce several newly developed software applications, several parsing techniques to improve Japanese-to-English text translation, and a new key algorithm to correct translation errors when converting from Japanese kanji to English. The overall translation improvement is measured using the BLEU metric (an objective, numerical standard in Machine Translation quality analysis). The baseline translation system was built by combining Giza++, the Thot Phrase-Based SMT toolkit, the SRILM toolkit, and the Pharaoh decoder. The input and output parsing applications were created as intermediary to improve the baseline MT system as to eliminate artificially high improvement metrics. This baseline was measured with and without the additional parsing provided by the thesis software applications, and also with and without the thesis kanji correction utility.
The new algorithm corrected for many contextual definition mistakes that are common when converting from Japanese to English text. By training the new kanji correction utility on an existing dictionary, identifying source text in Japanese with a high number of possible translations, and checking the baseline translation against other translation possibilities; I was able to increase the translation performance of the baseline system from minimum normalized BKEU scores of .0273 to maximum normalized scores of .081.
The preliminary phase of making improvements to Japanese-to-English translation focused on correcting segmentation mistakes that occur when attempting to parse Japanese text into meaningful tokens. The initial increase is not indicative of future potential and is artificially high as the baseline score was so low to begin with, but was needed to create a reasonable baseline score.
The final results of the tests confirmed that a significant, measurable improvement had been achieved through improving the initial segmentation of the Japanese text through parsing the input corpora and through correcting kanji translations after the Pharaoh decoding process had completed.
|
6 |
Discriminative Alignment Models For Statistical Machine Translation / Modèles Discriminants d'Alignement Pour La Traduction Automatique StatistiqueTomeh, Nadi 27 June 2012 (has links)
La tâche d'alignement d'un texte dans une langue source avec sa traduction en langue cible est souvent nommée alignement de bi-textes. Elle a pour but de faire émerger les relations de traduction qui peuvent s'exprimer à différents niveaux de granularité entre les deux faces du bi-texte. De nombreuses applications de traitement automatique des langues naturelles s'appuient sur cette étape afin d'accéder à des connaissances linguistiques de plus haut niveau.Parmi ces applications, nous pouvons citer bien sûr la traduction automatique, mais également l'extraction de lexiques et de terminologies bilingues, la désambigüisation sémantique ou l'apprentissage des langues assisté par ordinateur.La complexité de la tâche d'alignement de bi-textes s'explique par les différences linguistiques entre les langues. Ces différences peuvent être d'ordre sémantique, syntaxique, ou morphologique.Dans le cadre des approches probabilistes, l'alignement de bi-textes est modélisé par un ensemble de variables aléatoires cachés. Afin de réduire la complexité du problème, le processus aléatoire sous-jacent fait l'hypothèse simplificatrice qu'un mot en langue source est lié à au plus un mot en langue cible, ce qui induit une relation de traduction asymétrique. Néanmoins, cette hypothèse est simpliste, puisque les alignements peuvent de manière générale impliquer des groupes de mots dans chacune des langues. Afin de rétablir cette symétrie, chaque langue est considérée tour à tour comme la langue source et les deux alignements asymétriques résultants sont combinés à l'aide d'une heuristique. Cette étape de symétrisation revêt une importance particulière dans l’approche standard en traduction automatique, puisqu'elle précède l'extraction des unités de traduction, à savoir les paires de segments.L'objectif de cette thèse est de proposer de nouvelles approches pour d'une part l'alignement debi-texte, et d'autre part l'extraction des unités de traduction. La spécificité de notre approche consiste à remplacer les heuristiques utilisées par des modèles d'apprentissage discriminant.Nous présentons un modèle "Maximum d'entropie'' (ou MaxEnt) pour l'alignement de mot, pour lequel chaque lien d'alignement est prédit de manière indépendante. L'interaction entre les liens d'alignement est alors prise en compte par l'empilement ("stacking'') d'un second modèle prenant en compte la structure à prédire sans pour autant augmenter la complexité globale. Cette formulation peut être vue comme une manière d'apprendre la combinaison de différentes méthodes d'alignement: le modèle considère ainsi l'union des alignements d'entrées pour en sélectionner les liens jugés fiables. Le modèle MaxEnt proposé permet d'améliorer les performances d'un système état de l'art de traduction automatique en considérant le jeu de données de la tâche NIST'09, Arabe vers Anglais. Ces améliorations sont mesurées en terme de taux d'erreur sur les alignements et aussi en terme de qualité de traduction via la métrique automatique BLEU.Nous proposons également un modèle permettant à la fois de sélectionner et d'évaluer les unités de traduction extraites d'un bi texte aligné. Ces deux étapes sont reformulées dans le cadre de l'apprentissage supervisé, afin de modéliser la décision de garder ou pas une paire de segments comme une unité fiable de traduction. Ce cadre permet l'utilisation de caractéristiques riches et nombreuses favorisant ainsi une décision robuste. Nous proposons une méthode simple et efficace pour annoter les paires de segments utiles pour la traduction. Le problème d'apprentissage automatique qui se pose alors est particulier, puisque nous disposons que d'exemples positifs. Nous proposons donc d'utiliser l'approche SVM à une classe afin de modéliser la sélection des unités de traduction.Grâce à cette approche, nous obtenons des améliorations significatives en terme de score BLEU pour un système entrainé avec un petit ensemble de données. / Bitext alignment is the task of aligning a text in a source language and its translation in the target language. Aligning amounts to finding the translational correspondences between textual units at different levels of granularity. Many practical natural language processing applications rely on bitext alignments to access the rich linguistic knowledge present in a bitext. While the most predominant application for bitexts is statistical machine translation, they are also used in multilingual (and monolingual) lexicography, word sense disambiguation, terminology extraction, computer-aided language learning andtranslation studies, to name a few.Bitext alignment is an arduous task because meaning is not expressed seemingly across languages. It varies along linguistic properties and cultural backgrounds of different languages, and also depends on the translation strategy that have been used to produce the bitext.Current practices in bitext alignment model the alignment as a hidden variable in the translation process. In order to reduce the complexity of the task, such approaches suppose that a word in the source sentence is aligned to one word at most in the target sentence.However, this over-simplistic assumption results in asymmetric, one-to-many alignments, whereas alignments are typically symmetric and many-to-many.To achieve symmetry, two one-to-many alignments in opposite translation directions are built and combined using a heuristic.In order to use these word alignments in phrase-based translation systems which use phrases instead of words, a heuristic is used to extract phrase pairs that are consistent with the word alignment.In this dissertation we address both the problems of word alignment and phrase pairs extraction.We improve the state of the art in several ways using discriminative learning techniques.We present a maximum entropy (MaxEnt) framework for word alignment.In this framework, links are predicted independently from one another using a MaxEnt classifier.The interaction between alignment decisions is approximated using stackingtechniques, which allows us to account for a part of the structural dependencies without increasing the complexity. This formulation can be seen as an alignment combination method,in which the union of several input alignments is used to guide the output alignment. Additionally, input alignments are used to compute a rich set of feature functions.Our MaxEnt aligner obtains state of the art results in terms of alignment quality as measured by thealignment error rate, and translation quality as measured by BLEU on large-scale Arabic-English NIST'09 systems.We also present a translation quality informed procedure for both extraction and evaluation of phrase pairs. We reformulate the problem in the supervised framework in which we decide for each phrase pair whether we keep it or not in the translation model. This offers a principled way to combine several features to make the procedure more robust to alignment difficulties. We use a simple and effective method, based on oracle decoding,to annotate phrase pairs that are useful for translation. Using machine learning techniques based on positive examples only,these annotations can be used to learn phrase alignment decisions. Using this approach we obtain improvements in BLEU scores for recall-oriented translation models, which are suitable for small training corpora.
|
7 |
Concept of Operations (CONOPS) for foreign language and speech translation technologies in a coalition military environmentMarshall, Susan LaVonne 03 1900 (has links)
Approved for public release, distribution is unlimited / This thesis presents Concept of Operations (CONOPS) for two specific automated language translation (ALT) devices, the P2 Phraselator and the Voice Response Translator (VRT). The CONOPS for each device are written as Appendix A and Appendix B respectively. The body of the thesis presents a broad introduction to the present state of ALT technology for the reader who is new to the general subject. It pursues this goal by introducing the human language translation problem followed by nine characteristic descriptors of ALT technology devices to provide a basic comparison framework of existing technologies. The premise is that ALT technology is presently in a state where it is tackled incrementally with various approaches. Two tables are provided that illustrate six commercially available devices using the descriptors. A scenario is then described in which the author observed the two subject ALT devices (depicted in the CONOPS in the Appendices) being employed within an international military exercise. Some unique human observations associated with the use of these devices in the exercise are discussed. A summary is provided of the Department of Defense (DOD) process that is exploring ALT technology devices, specifically the Language and Speech Exploitation Resources (LASER) Advanced Concept Technology Demonstration ACTD. / Lieutenant Commander, United States Navy
|
Page generated in 0.1055 seconds