11 |
Entity-based coherence in statistical machine translation : a modelling and evaluation perspectiveWetzel, Dominikus Emanuel January 2018 (has links)
Natural language documents exhibit coherence and cohesion by means of interrelated structures both within and across sentences. Sentences do not stand in isolation from each other and only a coherent structure makes them understandable and sound natural to humans. In Statistical Machine Translation (SMT) only little research exists on translating a document from a source language into a coherent document in the target language. The dominant paradigm is still one that considers sentences independently from each other. There is both a need for a deeper understanding of how to handle specific discourse phenomena, and for automatic evaluation of how well these phenomena are handled in SMT. In this thesis we explore an approach how to treat sentences as dependent on each other by focussing on the problem of pronoun translation as an instance of a discourse-related non-local phenomenon. We direct our attention to pronoun translation in the form of cross-lingual pronoun prediction (CLPP) and develop a model to tackle this problem. We obtain state-of-the-art results exhibiting the benefit of having access to the antecedent of a pronoun for predicting the right translation of that pronoun. Experiments also showed that features from the target side are more informative than features from the source side, confirming linguistic knowledge that referential pronouns need to agree in gender and number with their target-side antecedent. We show our approach to be applicable across the two language pairs English-French and English-German. The experimental setting for CLPP is artificially restricted, both to enable automatic evaluation and to provide a controlled environment. This is a limitation which does not yet allow us to test the full potential of CLPP systems within a more realistic setting that is closer to a full SMT scenario. We provide an annotation scheme, a tool and a corpus that enable evaluation of pronoun prediction in a more realistic setting. The annotated corpus consists of parallel documents translated by a state-of-the-art neural machine translation (NMT) system, where the appropriate target-side pronouns have been chosen by annotators. With this corpus, we exhibit a weakness of our current CLPP systems in that they are outperformed by a state-of-the-art NMT system in this more realistic context. This corpus provides a basis for future CLPP shared tasks and allows the research community to further understand and test their methods. The lack of appropriate evaluation metrics that explicitly capture non-local phenomena is one of the main reasons why handling non-local phenomena has not yet been widely adopted in SMT. To overcome this obstacle and evaluate the coherence of translated documents, we define a bilingual model of entity-based coherence, inspired by work on monolingual coherence modelling, and frame it as a learning-to-rank problem. We first evaluate this model on a corpus where we artificially introduce coherence errors based on typical errors CLPP systems make. This allows us to assess the quality of the model in a controlled environment with automatically provided gold coherence rankings. Results show that this model can distinguish with high accuracy between a human-authored translation and one with coherence errors, that it can also distinguish between document pairs from two corpora with different degrees of coherence errors, and that the learnt model can be successfully applied when the test set distribution of errors comes from a different one than the one from the training data, showing its generalization potentials. To test our bilingual model of coherence as a discourse-aware SMT evaluation metric, we apply it to more realistic data. We use it to evaluate a state-of-the-art NMT system against post-editing systems with pronouns corrected by our CLPP systems. For verifying our metric, we reuse our annotated parallel corpus and consider the pronoun annotations as proxy for human document-level coherence judgements. Experiments show far lower accuracy in ranking translations according to their entity-based coherence than on the artificial corpus, suggesting that the metric has difficulties generalizing to a more realistic setting. Analysis reveals that the system translations in our test corpus do not differ in their pronoun translations in almost half of the document pairs. To circumvent this data sparsity issue, and to remove the need for parameter learning, we define a score-based SMT evaluation metric which directly uses features from our bilingual coherence model.
|
12 |
Évaluation de la production de quatre systèmes traduction automatiqueYen, Christine 03 December 2013 (has links)
This thesis aims to contribute to the improvement of online machine translation software. We identify errors in the process of translation between English and French and make recommendations. The systems evaluated are Promt, Babylon, Google Translate and Bing and the reference corpus is taken from BankGloss. Promt made the most errors, followed by Babylon, Bing and Google. The systems together produced a total of 147 grammatical errors, 74 semantic errors, 17 lexical errors, and 6 stylistic errors. To improve Promt, we suggest expanding its dictionary. For Babylon, we advise adding more grammar rules. In order to reduce the number of semantic errors in Bing and Google, the software should learn to identify words according to context. Machine translation is not an end in itself, but a good aid in accomplishing translation tasks.
|
13 |
LFG-DOT : a hybrid architecture for robust MTWay, Andrew January 2001 (has links)
No description available.
|
14 |
Využití větné struktury v neuronovém strojovém překladu / Využití větné struktury v neuronovém strojovém překladuPham, Thuong-Hai January 2018 (has links)
Neural machine translation has been lately established as the new state of the art in machine translation, especially with the Transformer model. This model emphasized the importance of self-attention mechanism and sug- gested that it could capture some linguistic phenomena. However, this claim has not been examined thoroughly, so we propose two main groups of meth- ods to examine the relation between these two. Our methods aim to im- prove the translation performance by directly manipulating the self-attention layer. The first group focuses on enriching the encoder with source-side syn- tax with tree-related position embeddings or our novel specialized attention heads. The second group is a joint translation and parsing model leveraging self-attention weight for the parsing task. It is clear from the results that enriching the Transformer with sentence structure can help. More impor- tantly, the Transformer model is in fact able to capture this type of linguistic information with guidance in the context of multi-task learning at nearly no increase in training costs. 1
|
15 |
Incorporating pronoun function into statistical machine translationGuillou, Liane Kirsten January 2016 (has links)
Pronouns are used frequently in language, and perform a range of functions. Some pronouns are used to express coreference, and others are not. Languages and genres differ in how and when they use pronouns and this poses a problem for Statistical Machine Translation (SMT) systems (Le Nagard and Koehn, 2010; Hardmeier and Federico, 2010; Novák, 2011; Guillou, 2012; Weiner, 2014; Hardmeier, 2014). Attention to date has focussed on coreferential (anaphoric) pronouns with NP antecedents, which when translated from English into a language with grammatical gender, must agree with the translation of the head of the antecedent. Despite growing attention to this problem, little progress has been made, and little attention has been given to other pronouns. The central claim of this thesis is that pronouns performing different functions in text should be handled differently by SMT systems and when evaluating pronoun translation. This motivates the introduction of a new framework to categorise pronouns according to their function: Anaphoric/cataphoric reference, event reference, extra-textual reference, pleonastic, addressee reference, speaker reference, generic reference, or other function. Labelling pronouns according to their function also helps to resolve instances of functional ambiguity arising from the same pronoun in the source language having multiple functions, each with different translation requirements in the target language. The categorisation framework is used in corpus annotation, corpus analysis, SMT system development and evaluation. I have directed the annotation and conducted analyses of a parallel corpus of English-German texts called ParCor (Guillou et al., 2014), in which pronouns are manually annotated according to their function. This provides a first step toward understanding the problems that SMT systems face when translating pronouns. In the thesis, I show how analysis of manual translation can prove useful in identifying and understanding systematic differences in pronoun use between two languages and can help inform the design of SMT systems. In particular, the analysis revealed that the German translations in ParCor contain more anaphoric and pleonastic pronouns than their English originals, reflecting differences in pronoun use. This raises a particular problem for the evaluation of pronoun translation. Automatic evaluation methods that rely on reference translations to assess pronoun translation, will not be able to provide an adequate evaluation when the reference translation departs from the original source-language text. I also show how analysis of the output of state-of-the-art SMT systems can reveal how well current systems perform in translating different types of pronouns and indicate where future efforts would be best directed. The analysis revealed that biases in the training data, for example arising from the use of “it” and “es” as both anaphoric and pleonastic pronouns in both English and German, is a problem that SMT systems must overcome. SMT systems also need to disambiguate the function of those pronouns with ambiguous surface forms so that each pronoun may be translated in an appropriate way. To demonstrate the value of this work, I have developed an automated post-editing system in which automated tools are used to construct ParCor-style annotations over the source-language pronouns. The annotations are then used to resolve functional ambiguity for the pronoun “it” with separate rules applied to the output of a baseline SMT system for anaphoric vs. non-anaphoric instances. The system was submitted to the DiscoMT 2015 shared task on pronoun translation for English-French. As with all other participating systems, the automatic post-editing system failed to beat a simple phrase-based baseline. A detailed analysis, including an oracle experiment in which manual annotation replaces the automated tools, was conducted to discover the causes of poor system performance. The analysis revealed that the design of the rules and their strict application to the SMT output are the biggest factors in the failure of the system. The lack of automatic evaluation metrics for pronoun translation is a limiting factor in SMT system development. To alleviate this problem, Christian Hardmeier and I have developed a testing regimen called PROTEST comprising (1) a hand-selected set of pronoun tokens categorised according to the different problems that SMT systems face and (2) an automated evaluation script. Pronoun translations can then be automatically compared against a reference translation, with mismatches referred for manual evaluation. The automatic evaluation was applied to the output of systems submitted to the DiscoMT 2015 shared task on pronoun translation. This again highlighted the weakness of the post-editing system, which performs poorly due to its focus on producing gendered pronoun translations, and its inability to distinguish between pleonastic and event reference pronouns.
|
16 |
English to ASL Gloss Machine TranslationBonham, Mary Elizabeth 01 June 2015 (has links) (PDF)
Low-resource languages, including sign languages, are a challenge for machine translation research. Given the lack of parallel corpora, current researchers must be content with a small parallel corpus in a narrow domain for training a system. For this thesis, we obtained a small parallel corpus of English text and American Sign Language gloss from The Church of Jesus Christ of Latter-day Saints. We cleaned the corpus by loading it into an open-source translation memory tool, where we removed computer markup language and split the large chunks of text into sentences and phrases, creating a total of 14,247 sentence pairs. We randomly partitioned the corpus into three sections: 70% for a training set, 10% for a development set, and 20% for a test set. After downloading and installing the open-source Moses toolkit, we went through several iterations of training, translating, and evaluating the system. The final evaluation on unseen data yielded a state-of-the-art score for a low-resource language.
|
17 |
Machine Translation For MachinesTebbifakhr, Amirhossein 25 October 2021 (has links)
Traditionally, Machine Translation (MT) systems are developed by targeting fluency (i.e. output grammaticality) and adequacy (i.e. semantic equivalence with the source text) criteria that reflect the needs of human end-users. However, recent advancements in Natural Language Processing (NLP) and the introduction of NLP tools in commercial services have opened new opportunities for MT. A particularly relevant one is related to the application of NLP technologies in low-resource language settings, for which the paucity of training data reduces the possibility to train reliable services. In this specific condition, MT can come into play by enabling the so-called “translation-based” workarounds. The idea is simple: first, input texts in the low-resource language are translated into a resource-rich target language; then, the machine-translated text is processed by well-trained NLP tools in the target language; finally, the output of these downstream components is projected back to the source language. This results in a new scenario, in which the end-user of MT technology is no longer a human but another machine. We hypothesize that current MT training approaches are not the optimal ones for this setting, in which the objective is to maximize the performance of a downstream tool fed with machine-translated text rather than human comprehension. Under this hypothesis, this thesis introduces a new research paradigm, which we named “MT for machines”, addressing a number of questions that raise from this novel view of the MT problem. Are there different quality criteria for humans and machines? What makes a good translation from the machine standpoint? What are the trade-offs between the two notions of quality? How to pursue machine-oriented objectives? How to serve different downstream components with a single MT system? How to exploit knowledge transfer to operate in different language settings with a single MT system? Elaborating on these questions, this thesis: i) introduces a novel and challenging MT paradigm, ii) proposes an effective method based on Reinforcement Learning analysing its possible variants, iii) extends the proposed method to multitask and multilingual settings so as to serve different downstream applications and languages with a single MT system, iv) studies the trade-off between machine-oriented and human-oriented criteria, and v) discusses the successful application of the approach in two real-world scenarios.
|
18 |
Gender Bias in Automatic TranslationSavoldi, Beatrice 30 June 2023 (has links)
Automatic translation tools have facilitated navigating multilingual contexts, by providing accessible shortcuts for gathering, processing, and spreading information. As language technologies become more widely used and deployed on a large scale, however, their societal impact has sparked concern both within and outside the research community.
This thesis adresses gender bias affecting Machine Translation (MT) and Speech Translation (ST) models. It contributes to this pressing area of research with an interdisciplinary perspective, to raise awareness of bias, improve the understanding of the phenomenon, and investigate best practices and methods to unveil and mitigate it in translation systems.
|
19 |
Automatic subtitling: A new paradigmKarakanta, Alina 11 November 2022 (has links)
Audiovisual Translation (AVT) is a field where Machine Translation (MT) has long found limited success mainly due to the multimodal nature of the source and the formal requirements of the target text. Subtitling is the predominant AVT type, quickly and easily providing access to the vast amounts of audiovisual content becoming available daily. Automation in subtitling has so far focused on MT systems which translate source language subtitles, already transcribed and timed by humans. With recent developments in speech translation (ST), the time is ripe for extended automation in subtitling, with end-to-end solutions for obtaining target language subtitles directly from the source speech. In this thesis, we address the key steps for accomplishing the new paradigm of automatic subtitling: data, models and evaluation. First, we address the lack of representative data by compiling MuST-Cinema, a speech-to-subtitles corpus. Segmenter models trained on MuST-Cinema accurately split sentences into subtitles, and enable automatic data augmentation techniques. Having representative data at hand, we move to developing direct ST models for three scenarios: offline subtitling, dual subtitling, live subtitling. Lastly, we propose methods for evaluating subtitle-specific aspects, such as metrics for subtitle segmentation, a product- and process-based exploration of the effect of spotting changes in the subtitle post-editing process, and finally, a comprehensive survey on subtitlers' user experience and views on automatic subtitling. Our findings show the potential of speech technologies for extending automation in subtitling to provide multilingual access to information and communication.
|
20 |
From Bible to Babel Fish: The Evolution of Translation and Translation TheorySettle, Lori Louise 20 May 2004 (has links)
Translation, the transfer of the written word from one language to another, has a long history, and many important scholars have helped shape its perceptions, accepted processes, and theories. Machine translation, translation by computer software requiring little or no human input, is the latest movement in the translation field, a possible way for the profession to keep abreast of the enormous demand for scientific, business, and technical translations. This study examines MT by placing it in a historical context — first exploring the history of translation and translation theory, then following that explanation with one of machine translation, its problems, and its potential. / Master of Arts
|
Page generated in 0.0283 seconds