71 |
Désambiguïsation lexicale de l'arabe pour et par la traduction automatique / Arabic word sense disambiguation for and by machine translationHadj salah, Marwa 18 December 2018 (has links)
Nous abordons dans cette thèse une étude sur la tâche de la désambiguïsation lexicale qui est une tâche centrale pour le traitement automatique des langues, et qui peut améliorer plusieurs applications telles que la traduction automatique ou l'extraction d'informations. Les recherches en désambiguïsation lexicale concernent principalement l'anglais, car la majorité des autres langues manque d'une référence lexicale standard pour l'annotation des corpus, et manque aussi de corpus annotés en sens pour l'évaluation, et plus important pour la construction des systèmes de désambiguïsation lexicale. En anglais, la base de données lexicale wordnet est une norme de-facto de longue date utilisée dans la plupart des corpus annotés et dans la plupart des campagnes d'évaluation.Notre contribution porte sur plusieurs axes: dans un premier temps, nous présentons une méthode pour la création automatique de corpus annotés en sens pour n'importe quelle langue, en tirant parti de la grande quantité de corpus anglais annotés en sens wordnet, et en utilisant un système de traduction automatique. Cette méthode est appliquée sur la langue arabe et est évaluée sur le seul corpus arabe, qui à notre connaissance, soit annoté manuellement en sens wordnet: l'OntoNotes 5.0 arabe que nous avons enrichi semi-automatiquement. Son évaluation est réalisée grâce à la mise en œuvre de deux systèmes supervisés (SVM, LSTM) qui sont entraînés sur les corpus produits avec notre méthode.Grâce ce travail, nous proposons ainsi une base de référence solide pour l'évaluation des futurs systèmes de désambiguïsation lexicale de l’arabe, en plus des corpus arabes annotés en sens que nous fournissons en tant que ressource librement disponible.Dans un second temps, nous proposons une évaluation in vivo de notre système de désambiguïsation de l’arabe en mesurant sa contribution à la performance de la tâche de traduction automatique. / This thesis concerns a study of Word Sense Disambiguation (WSD), which is a central task in natural language processing and that can improve applications such as machine translation or information extraction. Researches in word sense disambiguation predominantly concern the English language, because the majority of other languages lacks a standard lexical reference for the annotation of corpora, and also lacks sense annotated corpora for the evaluation, and more importantly for the construction of word sense disambiguation systems. In English, the lexical database wordnet is a long-standing de-facto standard used in most sense annotated corpora and in most WSD evaluation campaigns.Our contribution to this thesis focuses on several areas:first of all, we present a method for the automatic creation of sense annotated corpora for any language, by taking advantage of the large amount of wordnet sense annotated English corpora, and by using a machine translation system. This method is applied on Arabic and is evaluated, to our knowledge, on the only Arabic manually sense annotated corpus with wordnet: the Arabic OntoNotes 5.0, which we have semi-automatically enriched.Its evaluation is performed thanks to an implementation of two supervised word sense disambiguation systems that are trained on the corpora produced using our method. We hence propose a solid baseline for the evaluation of future Arabic word sense disambiguation systems, in addition to sense annotated Arabic corpora that we provide as a freely available resource.Secondly, we propose an in vivo evaluation of our Arabic word sense disambiguation system by measuring its contribution to the performance of the machine translation task.
|
72 |
New data-driven approaches to text simplificationŠtajner, Sanja January 2015 (has links)
Many texts we encounter in our everyday lives are lexically and syntactically very complex. This makes them difficult to understand for people with intellectual or reading impairments, and difficult for various natural language processing systems to process. This motivated the need for text simplification (TS) which transforms texts into their simpler variants. Given that this is still a relatively new research area, many challenges are still remaining. The focus of this thesis is on better understanding the current problems in automatic text simplification (ATS) and proposing new data-driven approaches to solving them. We propose methods for learning sentence splitting and deletion decisions, built upon parallel corpora of original and manually simplified Spanish texts, which outperform the existing similar systems. Our experiments in adaptation of those methods to different text genres and target populations report promising results, thus offering one possible solution for dealing with the scarcity of parallel corpora for text simplification aimed at specific target populations, which is currently one of the main issues in ATS. The results of our extensive analysis of the phrase-based statistical machine translation (PB-SMT) approach to ATS reject the widespread assumption that the success of that approach largely depends on the size of the training and development datasets. They indicate more influential factors for the success of the PB-SMT approach to ATS, and reveal some important differences between cross-lingual MT and the monolingual v MT used in ATS. Our event-based system for simplifying news stories in English (EventSimplify) overcomes some of the main problems in ATS. It does not require a large number of handcrafted simplification rules nor parallel data, and it performs significant content reduction. The automatic and human evaluations conducted show that it produces grammatical text and increases readability, preserving and simplifying relevant content and reducing irrelevant content. Finally, this thesis addresses another important issue in TS which is how to automatically evaluate the performance of TS systems given that access to the target users might be difficult. Our experiments indicate that existing readability metrics can successfully be used for this task when enriched with human evaluation of grammaticality and preservation of meaning.
|
73 |
Feature Selection for Factored Phrase-Based Machine Translation / Feature Selection for Factored Phrase-Based Machine TranslationTamchyna, Aleš January 2012 (has links)
In the presented work we investigate factored models for machine translation. We provide a thorough theoretical description of this machine translation paradigm. We describe a method for evaluating the complexity of factored models and verify its usefulness in practice. We present a software tool for automatic creation of machine translation experiments and search in the space of possible configurations. In the experimental part of the work we verify our analyses and give some insight into the potential of factored systems. We indicate some of the possible directions that lead to improvement in translation quality, however we conclude that it is not possible to explore these options in a fully automatic way.
|
74 |
Indução de léxicos bilíngües e regras para a tradução automática / Induction of translation lexicons and transfer rules for machine translationHelena de Medeiros Caseli 21 May 2007 (has links)
A Tradução Automática (TA) -- tradução de uma língua natural (fonte) para outra (alvo) por meio de programas de computador -- é uma tarefa árdua devido, principalmente, à necessidade de um conhecimento lingüístico aprofundado das duas (ou mais) línguas envolvidas para a construção de recursos, como gramáticas de tradução, dicionários bilíngües etc. A escassez de recursos lingüísticos, e mesmo a dificuldade em produzi-los, geralmente são fatores limitantes na atuação dos sistemas de TA, restringindo-os, por exemplo, quanto ao domínio de aplicação. Neste contexto, diversos métodos vêm sendo propostos com o intuito de gerar, automaticamente, conhecimento lingüístico a partir dos recursos multilíngües e, assim, tornar a construção de tradutores automáticos menos trabalhosa. O projeto ReTraTos, apresentado neste documento, é uma dessas propostas e visa à indução automática de léxicos bilíngües e de regras de tradução a partir de corpora paralelos etiquetados morfossintaticamente e alinhados lexicalmente para os pares de idiomas português--espanhol e português--inglês. O sistema proposto para a indução de regras de tradução apresenta uma abordagem inovadora na qual os exemplos de tradução são divididos em blocos de alinhamento e a indução é realizada para cada bloco, separadamente. Outro fator inovador do sistema de indução é uma filtragem mais elaborada das regras induzidas. Além dos sistemas de indução de léxicos bilíngües e de regras de tradução, implementou-se também um módulo de tradução automática para permitir a validação dos recursos induzidos. Os léxicos bilíngües foram avaliados intrinsecamente e os resultados obtidos estão de acordo com os relatados na literatura para essa área. As regras de tradução foram avaliadas direta e indiretamente por meio do módulo de TA e sua utilização trouxe um ganho na tradução palavra-a-palavra em todos os sentidos (fonte--alvo e alvo--fonte) para a tradução dos idiomas em estudo. As traduções geradas com os recursos induzidos no ReTraTos também foram comparadas às geradas por sistemas comerciais, apresentando melhores resultados para o par de línguas português--espanhol do que para o par português--inglês. / Machine Translation (MT) -- the translation of a natural (source) language into another (target) by means of computer programs -- is a hard task, mainly due to the need of deep linguistic knowledge about the two (or more) languages required to build resources such as translation grammars, bilingual dictionaries, etc. The scarcity of linguistic resources or even the difficulty to build them often limits the use of MT systems, for example, to certain application domains. In this context, several methods have been proposed aiming at generating linguistic knowledge automatically from multilingual resources, so that building translation tools becomes less hard. The ReTraTos project presented in this document is one of these proposals and aims at inducing translation lexicons and transfer rules automatically from PoS-tagged and lexically aligned translation examples for Portuguese--Spanish and Portuguese--English language pairs. The rule induction system brings forth a new approach, in which translation examples are split into alignment blocks and induction is performed for each type of block separately. Another new feature of this system is a more elaborate strategy for filtering the induced rules. Besides the translation lexicon and the transfer rule induction systems, we also implemented a MT module for validating the induced resources. The induced translation lexicons were evaluated intrinsically and the results obtained agree with those reported on the literature. The induced translation rules were evaluated directly and indirectly by the MT module, and improved the word-by-word translation in both directions (source--target and target--source) for the languages under study. The target sentences obtained by the induced resources were also compared to those generated by commercial systems, showing better results for Portuguese--Spanish than for Portuguese--English.
|
75 |
On the application of focused crawling for statistical machine translation domain adaptationLaranjeira, Bruno Rezende January 2015 (has links)
O treinamento de sistemas de Tradução de Máquina baseada em Estatística (TME) é bastante dependente da disponibilidade de corpora paralelos. Entretanto, este tipo de recurso costuma ser difícil de ser encontrado, especialmente quando lida com idiomas com poucos recursos ou com tópicos muito específicos, como, por exemplo, dermatologia. Para contornar esta situação, uma possibilidade é utilizar corpora comparáveis, que são recursos muito mais abundantes. Um modo de adquirir corpora comparáveis é a aplicação de algoritmos de Coleta Focada (CF). Neste trabalho, são propostas novas abordagens para CF, algumas baseadas em n-gramas e outras no poder expressivo das expressões multipalavra. Também são avaliadas a viabilidade do uso de CF para realização de adaptação de domínio para sistemas genéricos de TME e se há alguma correlação entre a qualidade dos algoritmos de CF e dos sistemas de TME que podem ser construídos a partir dos respectivos dados coletados. Os resultados indicam que algoritmos de CF podem ser bons meios para adquirir corpora comparáveis para realizar adaptação de domínio para TME e que há uma correlação entre a qualidade dos dois processos. / Statistical Machine Translation (SMT) is highly dependent on the availability of parallel corpora for training. However, these kinds of resource may be hard to be found, especially when dealing with under-resourced languages or very specific domains, like the dermatology. For working this situation around, one possibility is the use of comparable corpora, which are much more abundant resources. One way of acquiring comparable corpora is to apply Focused Crawling (FC) algorithms. In this work we propose novel approach for FC algorithms, some based on n-grams and other on the expressive power of multiword expressions. We also assess the viability of using FC for performing domain adaptations for generic SMT systems and whether there is a correlation between the quality of the FC algorithms and of the SMT systems that can be built with its collected data. Results indicate that the use of FCs is, indeed, a good way for acquiring comparable corpora for SMT domain adaptation and that there is a correlation between the qualities of both processes.
|
76 |
Feasible lexical selection for rule-based machine translation / Selecció lèxica factible per a la traducció automàtica basada en reglesTyers, Francis M. 17 July 2013 (has links)
No description available.
|
77 |
Utilité et utilisation de la traduction automatique dans l’environnement de traduction : une évaluation axée sur les traducteurs professionnelsRémillard, Judith 19 June 2018 (has links)
L’arrivée de la traduction automatique (TA) bouleverse les pratiques dans l’industrie de la traduction et soulève par le fait même des questions sur l’utilité et l’utilisation de cette technologie. Puisque de nombreuses études ont déjà porté sur son utilisation dans un contexte où elle est imposée aux traducteurs, nous avons choisi d’adopter la perspective toute particulière des traducteurs pour examiner son utilité et son utilisation volontaire dans l’environnement de traduction (ET). Notre recherche visait à répondre à trois grandes questions : les traducteurs utilisent-ils la TA dans leurs pratiques? Les traducteurs croient-ils que les données de sortie sont utiles et utilisables? Les traducteurs utilisent-ils concrètement ces données de sortie dans le processus de traduction?
Pour répondre à ces questions, nous avons d’abord diffusé un sondage à grande échelle afin de mesurer l’utilisation de la TA en tant qu’outil, de recueillir des données sur le profil des répondants et d’évaluer leur perception d’utilité par rapport aux données de sortie et aux divers types de phénomènes que nous avions identifiés au préalable avec l’aide de deux traducteurs professionnels. Ensuite, nous avons réalisé une expérience avec d’autres traducteurs professionnels où nous leur avons demandé de procéder à la traduction de courts segments et avons examiné s’ils utilisaient ou non ces données de sortie pour produire leur traduction. Notre analyse était fondée sur le principe que, dans un contexte d’utilisation volontaire, l’utilisation d’une donnée de sortie permet d’induire une perception d’utilité et d’examiner, par le fait même, l’utilité de la TA. Dans l’ensemble, nous avons trouvé que la TA n’est habituellement pas utilisée de façon volontaire, que les perceptions des traducteurs sont peu favorables à une telle utilisation, que la perception des traducteurs quant à l’utilité des données de sortie est aussi plutôt négative, mais que les données de sortie semblent être beaucoup plus utiles et utilisables que ce que ne pourraient le croire les traducteurs, car ils les ont généralement utilisées dans le processus de traduction. Nous avons aussi examiné les facteurs déterminants de l’utilité et de l’utilisation de la TA et des données de sortie.
|
78 |
Indonésko-anglický neuronový strojový překlad / Indonesian-English Neural Machine TranslationDwiastuti, Meisyarah January 2019 (has links)
Title: Indonesian-English Neural Machine Translation Author: Meisyarah Dwiastuti Department: Institute of Formal and Applied Linguistics Supervisor: Mgr. Martin Popel, Ph.D., Institute of Formal and Applied Linguis- tics Abstract: In this thesis, we conduct a study on neural machine translation (NMT) for an under-studied language, Indonesian, specifically for English-Indonesian (EN-ID) and Indonesian-English (ID-EN) in a low-resource domain, TED talks. Our goal is to implement domain adaptation methods to improve the low-resource EN-ID and ID-EN NMT systems. First, we implement model fine-tuning method for EN-ID and ID-EN NMT systems by leveraging a large parallel corpus contain- ing movie subtitles. Our analysis shows the benefit of this method for the improve- ment of both systems. Second, we improve our ID-EN NMT system by leveraging English monolingual corpora through back-translation. Our back-translation ex- periments focus on how to incorporate the back-translated monolingual corpora to the training set, in which we investigate various existing training regimes and introduce a novel 4-way-concat training regime. We also analyze the effect of fine- tuning our back-translation models with different scenarios. Experimental results show that our method of implementing back-translation followed by model...
|
79 |
Recycling Translations : Extraction of Lexical Data from Parallel Corpora and their Application in Natural Language ProcessingTiedemann, Jörg January 2003 (has links)
<p>The focus of this thesis is on re-using translations in natural language processing. It involves the collection of documents and their translations in an appropriate format, the automatic extraction of translation data, and the application of the extracted data to different tasks in natural language processing.</p><p>Five parallel corpora containing more than 35 million words in 60 languages have been collected within co-operative projects. All corpora are sentence aligned and parts of them have been analyzed automatically and annotated with linguistic markup.</p><p>Lexical data are extracted from the corpora by means of word alignment. Two automatic word alignment systems have been developed, the Uppsala Word Aligner (UWA) and the Clue Aligner. UWA implements an iterative "knowledge-poor" word alignment approach using association measures and alignment heuristics. The Clue Aligner provides an innovative framework for the combination of statistical and linguistic resources in aligning single words and multi-word units. Both aligners have been applied to several corpora. Detailed evaluations of the alignment results have been carried out for three of them using fine-grained evaluation techniques.</p><p>A corpus processing toolbox, Uplug, has been developed. It includes the implementation of UWA and is freely available for research purposes. A new version, Uplug II, includes the Clue Aligner. It can be used via an experimental web interface (UplugWeb).</p><p>Lexical data extracted by the word aligners have been applied to different tasks in computational lexicography and machine translation. The use of word alignment in monolingual lexicography has been investigated in two studies. In a third study, the feasibility of using the extracted data in interactive machine translation has been demonstrated. Finally, extracted lexical data have been used for enhancing the lexical components of two machine translation systems.</p>
|
80 |
Text Harmonization Strategies for Phrase-Based Statistical Machine TranslationStymne, Sara January 2012 (has links)
In this thesis I aim to improve phrase-based statistical machine translation (PBSMT) in a number of ways by the use of text harmonization strategies. PBSMT systems are built by training statistical models on large corpora of human translations. This architecture generally performs well for languages with similar structure. If the languages are different for example with respect to word order or morphological complexity, however, the standard methods do not tend to work well. I address this problem through text harmonization, by making texts more similar before training and applying a PBSMT system. I investigate how text harmonization can be used to improve PBSMT with a focus on four areas: compounding, definiteness, word order, and unknown words. For the first three areas, the focus is on linguistic differences between languages, which I address by applying transformation rules, using either rule-based or machine learning-based techniques, to the source or target data. For the last area, unknown words, I harmonize the translation input to the training data by replacing unknown words with known alternatives. I show that translation into languages with closed compounds can be improved by splitting and merging compounds. I develop new merging algorithms that outperform previously suggested algorithms and show how part-of-speech tags can be used to improve the order of compound parts. Scandinavian definite noun phrases are identified as a problem forPBSMT in translation into Scandinavian languages and I propose a preprocessing approach that addresses this problem and gives large improvements over a baseline. Several previous proposals for how to handle differences in reordering exist; I propose two types of extensions, iterating reordering and word alignment and using automatically induced word classes, which allow these methods to be used for less-resourced languages. Finally I identify several ways of replacing unknown words in the translation input, most notably a spell checking-inspired algorithm, which can be trained using character-based PBSMT techniques. Overall I present several approaches for extending PBSMT by the use of pre- and postprocessing techniques for text harmonization, and show experimentally that these methods work. Text harmonization methods are an efficient way to improve statistical machine translation within the phrase-based approach, without resorting to more complex models.
|
Page generated in 0.0277 seconds