Global ETD Search

11	Entity-based coherence in statistical machine translation : a modelling and evaluation perspective Wetzel, Dominikus Emanuel January 2018 (has links) Natural language documents exhibit coherence and cohesion by means of interrelated structures both within and across sentences. Sentences do not stand in isolation from each other and only a coherent structure makes them understandable and sound natural to humans. In Statistical Machine Translation (SMT) only little research exists on translating a document from a source language into a coherent document in the target language. The dominant paradigm is still one that considers sentences independently from each other. There is both a need for a deeper understanding of how to handle specific discourse phenomena, and for automatic evaluation of how well these phenomena are handled in SMT. In this thesis we explore an approach how to treat sentences as dependent on each other by focussing on the problem of pronoun translation as an instance of a discourse-related non-local phenomenon. We direct our attention to pronoun translation in the form of cross-lingual pronoun prediction (CLPP) and develop a model to tackle this problem. We obtain state-of-the-art results exhibiting the benefit of having access to the antecedent of a pronoun for predicting the right translation of that pronoun. Experiments also showed that features from the target side are more informative than features from the source side, confirming linguistic knowledge that referential pronouns need to agree in gender and number with their target-side antecedent. We show our approach to be applicable across the two language pairs English-French and English-German. The experimental setting for CLPP is artificially restricted, both to enable automatic evaluation and to provide a controlled environment. This is a limitation which does not yet allow us to test the full potential of CLPP systems within a more realistic setting that is closer to a full SMT scenario. We provide an annotation scheme, a tool and a corpus that enable evaluation of pronoun prediction in a more realistic setting. The annotated corpus consists of parallel documents translated by a state-of-the-art neural machine translation (NMT) system, where the appropriate target-side pronouns have been chosen by annotators. With this corpus, we exhibit a weakness of our current CLPP systems in that they are outperformed by a state-of-the-art NMT system in this more realistic context. This corpus provides a basis for future CLPP shared tasks and allows the research community to further understand and test their methods. The lack of appropriate evaluation metrics that explicitly capture non-local phenomena is one of the main reasons why handling non-local phenomena has not yet been widely adopted in SMT. To overcome this obstacle and evaluate the coherence of translated documents, we define a bilingual model of entity-based coherence, inspired by work on monolingual coherence modelling, and frame it as a learning-to-rank problem. We first evaluate this model on a corpus where we artificially introduce coherence errors based on typical errors CLPP systems make. This allows us to assess the quality of the model in a controlled environment with automatically provided gold coherence rankings. Results show that this model can distinguish with high accuracy between a human-authored translation and one with coherence errors, that it can also distinguish between document pairs from two corpora with different degrees of coherence errors, and that the learnt model can be successfully applied when the test set distribution of errors comes from a different one than the one from the training data, showing its generalization potentials. To test our bilingual model of coherence as a discourse-aware SMT evaluation metric, we apply it to more realistic data. We use it to evaluate a state-of-the-art NMT system against post-editing systems with pronouns corrected by our CLPP systems. For verifying our metric, we reuse our annotated parallel corpus and consider the pronoun annotations as proxy for human document-level coherence judgements. Experiments show far lower accuracy in ranking translations according to their entity-based coherence than on the artificial corpus, suggesting that the metric has difficulties generalizing to a more realistic setting. Analysis reveals that the system translations in our test corpus do not differ in their pronoun translations in almost half of the document pairs. To circumvent this data sparsity issue, and to remove the need for parameter learning, we define a score-based SMT evaluation metric which directly uses features from our bilingual coherence model.
12	Factored neural machine translation / Traduction automatique neuronale factorisée García Martínez, Mercedes 27 March 2018 (has links) La diversité des langues complexifie la tâche de communication entre les humains à travers les différentes cultures. La traduction automatique est un moyen rapide et peu coûteux pour simplifier la communication interculturelle. Récemment, laTraduction Automatique Neuronale (NMT) a atteint des résultats impressionnants. Cette thèse s'intéresse à la Traduction Automatique Neuronale Factorisé (FNMT) qui repose sur l'idée d'utiliser la morphologie et la décomposition grammaticale des mots (lemmes et facteurs linguistiques) dans la langue cible. Cette architecture aborde deux défis bien connus auxquelles les systèmes NMT font face. Premièrement, la limitation de la taille du vocabulaire cible, conséquence de la fonction softmax, qui nécessite un calcul coûteux à la couche de sortie du réseau neuronale, conduisant à un taux élevé de mots inconnus. Deuxièmement, le manque de données adéquates lorsque nous sommes confrontés à un domaine spécifique ou une langue morphologiquement riche. Avec l'architecture FNMT, toutes les inflexions des mots sont prises en compte et un vocabulaire plus grand est modélisé tout en gardant un coût de calcul similaire. De plus, de nouveaux mots non rencontrés dans les données d'entraînement peuvent être générés. Dans ce travail, j'ai développé différentes architectures FNMT en utilisant diverses dépendances entre les lemmes et les facteurs. En outre, j'ai amélioré la représentation de la langue source avec des facteurs. Le modèle FNMT est évalué sur différentes langues dont les plus riches morphologiquement. Les modèles à l'état de l'art, dont certains utilisant le Byte Pair Encoding (BPE) sont comparés avec le modèle FNMT en utilisant des données d'entraînement de petite et de grande taille. Nous avons constaté que les modèles utilisant les facteurs sont plus robustes aux conditions d'entraînement avec des faibles ressources. Le FNMT a été combiné avec des unités BPE permettant une amélioration par rapport au modèle FNMT entrainer avec des données volumineuses. Nous avons expérimenté avec dfférents domaines et nous avons montré des améliorations en utilisant les modèles FNMT. De plus, la justesse de la morphologie est mesurée à l'aide d'un ensemble de tests spéciaux montrant l'avantage de modéliser explicitement la morphologie de la cible. Notre travail montre les bienfaits de l'applicationde facteurs linguistiques dans le NMT. / Communication between humans across the lands is difficult due to the diversity of languages. Machine translation is a quick and cheap way to make translation accessible to everyone. Recently, Neural Machine Translation (NMT) has achievedimpressive results. This thesis is focus on the Factored Neural Machine Translation (FNMT) approach which is founded on the idea of using the morphological and grammatical decomposition of the words (lemmas and linguistic factors) in the target language. This architecture addresses two well-known challenges occurring in NMT. Firstly, the limitation on the target vocabulary size which is a consequence of the computationally expensive softmax function at the output layer of the network, leading to a high rate of unknown words. Secondly, data sparsity which is arising when we face a specific domain or a morphologically rich language. With FNMT, all the inflections of the words are supported and larger vocabulary is modelled with similar computational cost. Moreover, new words not included in the training dataset can be generated. In this work, I developed different FNMT architectures using various dependencies between lemmas and factors. In addition, I enhanced the source language side also with factors. The FNMT model is evaluated on various languages including morphologically rich ones. State of the art models, some using Byte Pair Encoding (BPE) are compared to the FNMT model using small and big training datasets. We found out that factored models are more robust in low resource conditions. FNMT has been combined with BPE units performing better than pure FNMT model when trained with big data. We experimented with different domains obtaining improvements with the FNMT models. Furthermore, the morphology of the translations is measured using a special test suite showing the importance of explicitly modeling the target morphology. Our work shows the benefits of applying linguistic factors in NMT. Traduction automatique neuronale Modèles factorisés Apprentissage profond Réseaux de neurones Traduction automatique Neural machine translation Factored models Deep learning Neural networks Machine translation 006.35
13	Exploring Contextual Information in Neural Machine Translation / Exploring Contextual Information in Neural Machine Translation Jon, Josef January 2019 (has links) Tato práce se zabývá zapojením mezivětného kontextu v neuronovém strojovém překladu (NMT). Dnešní běžné NMT systémy překládají jednu zdrojovou větu na jednu cílovou větu, bez jakéhokoliv ohledu na okolní text. Tento přístup je nedostačující a neodpovídá způsobu práce lidských překladatelů. Pro mnoho jazykových párů je dnes za splnění určitých (přísných) podmínek výstup NMT nerozeznatelný od lidského překladu. Jedna z těchto podmínek je, že hodnotitelé skórují přeložené věty nezávisle, bez znalosti kontextu. Při hodnocení celých dokumentů je výstup NMT stále hodnocen hůře, než lidský překlad, i v případech, kdy byl na úrovni jednotlivých vět preferován. Tato zjištění jsou motivací pro výzkum zapojení kontextu na úrovni dokumentu v NMT, je totiž možné, že na úrovni vět již není mnoho prostoru ke zlepšení, alespoň pro jazykové páry a domény bohaté na trénovací data. Tato práce shrnuje současné přístupy zapojení kontextu do překladu, několik z nich je implementováno a vyhodnoceno v rámci obecné překladové kvality i na překladu specifických fenoménů souvisejících s kontextem. Pro zhodnocení kvality jednotlivých systému byla ručně vytvořena testovací sada pro překlad z anglického do českého jazyka.
14	Strojový překlad s využitím syntaktické analýzy / Machine Translation Using Syntactic Analysis Popel, Martin January 2018 (has links) Machine Translation Using Syntactic Analysis Martin Popel This thesis describes our improvement of machine translation (MT), with a special focus on the English-Czech language pair, but using techniques ap- plicable also to other languages. First, we present multiple improvements of the deep-syntactic system TectoMT. For instance, we implemented a novel context-sensitive translation model, comparing several machine learning ap- proaches. We also adapted TectoMT to other domains and languages. Sec- ond, we present Transformer - a state-of-the-art end-to-end neural MT sys- tem. We analyzed in detail the effect of several training hyper-parameters. With our optimized training, the system outperformed the best result on the WMT2017 test set by +1.0 BLEU. We further extended this system by uti- lization of monolingual training data and by a new type of backtranslation (+2.8 BLEU compared to the baseline system). In addition, we leveraged domain adaptation and the effect of "translationese" (i.e which language in parallel data is the original and which is the translation) to optimize MT systems for original-language and translated-language data (gaining further +0.2 BLEU). Our improved neural MT system significantly (p¡0.05) out- performed all other systems in English-Czech and Czech-English WMT2018 shared tasks,...
15	Multimodalita ve strojovém překladu / Multimodality in Machine Translation Libovický, Jindřich January 2019 (has links) Multimodality in Machine Translation Jindřich Libovický Traditionally, most natural language processing tasks are solved within the lan- guage, relying on distributional properties of words. Representation learning abilities of deep learning recently allowed using additional information source by grounding the representations in the visual modality. One of the tasks that attempt to exploit the visual information is multimodal machine translation: translation of image captions when having access to the original image. The thesis summarizes joint processing of language and real-world images using deep learning. It gives an overview of the state of the art in multimodal machine translation and describes our original contribution to solving this task. We introduce methods of combining multiple inputs of possibly different modalities in recurrent and self-attentive sequence-to-sequence models and show results on multimodal machine translation and other tasks related to machine translation. Finally, we analyze how the multimodality influences the semantic properties of the sentence representation learned by the networks and how that relates to translation quality.
16	Exploring the Usage of Neural Networks for Repairing Static Analysis Warnings / Utforsking av användningen av neurala nätverk för att reparera varningar för statisk analys Lohse, Vincent Paul January 2021 (has links) C# provides static analysis libraries for template-based code analysis and code fixing. These libraries have been used by the open-source community to generate numerous NuGet packages for different use-cases. However, due to the unstructured vastness of these packages, it is difficult to find the ones required for a project and creating new analyzers and fixers take time and effort to create. Therefore, this thesis proposes a neural network, which firstly imitates existing fixers and secondly extrapolates to fixes of unseen diagnostics. To do so, the state-of-the-art of static analysis NuGet packages is examined and further used to generate a dataset with diagnostics and corresponding code fixes for 24,622 data points. Since many C# fixers apply formatting changes, all formatting is preserved in the dataset. Furthermore, since the fixers also apply identifier changes, the tokenization of the dataset is varied between splitting identifiers by camelcase and preserving them. The neural network uses a sequence-to-sequence learning approach with the Transformer model and takes file context, diagnostic message and location as input and predicts a diff as output. It is capable of imitating 46.3% of the fixes, normalized by diagnostic type, and for data points with unseen diagnostics, it is able to extrapolate to 11.9% of normalized data points. For both experiments, splitting identifiers by camelcase produces the best results. Lastly, it is found that a higher proportion of formatting tokens in input has minimal positive impact on prediction success rates, whereas the proportion of formatting in output has no impact on success rates. / C# tillhandahåller statiska analysbibliotek för mallbaserad kodanalys och kodfixering. Dessa bibliotek har använts av open source-gemenskapen för att generera många NuGet-paket för olika användningsfall. Men på grund av mängden av dessa paket är det svårt att hitta de som krävs för ett projekt och att skapa nya analysatorer och fixare tar tid och ansträngning att skapa. Därför föreslår denna avhandling ett neuralt nätverk, som för det första imiterar befintliga korrigeringar och för det andra extrapolerar till korrigeringar av osynlig diagnostik. För att göra det har det senaste inom statisk analys NuGetpaketen undersökts och vidare använts för att generera en datauppsättning med diagnostik och motsvarande kodfixar för 24 622 datapunkter. Eftersom många C# fixers tillämpar formateringsändringar, bevaras all formatering i datasetet. Dessutom, eftersom fixarna också tillämpar identifieringsändringar, varieras tokeniseringen av datamängden mellan att dela upp identifierare efter camelcase och att bevara dem. Det neurala nätverket använder en sekvenstill- sekvens-inlärningsmetod med Transformer-modellen och tar filkontext, diagnostiskt meddelande och plats som indata och förutsäger en skillnad som utdata. Den kan imitera 46,3% av korrigeringarna, normaliserade efter diagnostisk typ, och för datapunkter med osynlig diagnostik kan den extrapolera till 11,9% av normaliserade datapunkter. För båda experimenten ger uppdelning av identifierare efter camelcase de bästa resultaten. Slutligen har det visat sig att en högre andel formateringstokens i indata har minimal positiv inverkan på åndelen korrekta förutsägelser, medan andelen formatering i utdata inte har någon inverkan på åndelen korrekta förutsägelser. Automatic Program Repair Neural Machine Translation Static Analysis Transformer Model Formatting Automatisk programreparation neural maskinöversättning statisk analys transformatormodell formatering Software Engineering Programvaruteknik
17	An Initial Investigation of Neural Decompilation for WebAssembly / En Första Undersökning av Neural Dekompilering för WebAssembly Benali, Adam January 2022 (has links) WebAssembly is a new standard of the World Wide Web that is used as a compilation target and which is meant to enable high-performance applications. As it becomes more popular, the need for corresponding decompilers increases, for security reasons for instance. However, building an accurate decompiler capable of restoring the original source code is a complicated task. Recently, Neural Machine Translation (NMT) has been proposed as an alternative to traditional decompilers which involve a lot of manual and laborious work. We investigate the viability of Neural Machine Translation for decompiling WebAssembly binaries to C source code. The state-of-the-art transformer and LSTM sequence-to-sequence (Seq2Seq) models with attention are experimented with. We build a custom randomly-generated dataset ofWebAssembly to C pairs of source code and use different metrics to quantitatively evaluate the performance of the models. Our implementation consists of several processing steps that have the WebAssembly input and the C output as endpoints. The results show that the transformer outperforms the LSTM based neural model. Besides, while the model restores the syntax and control-flow structure with up to 95% of accuracy, it is incapable of recovering the data-flow. The different benchmarks on which we run our evaluation indicate a drop of decompilation accuracy as the cyclomatic complexity and the nesting of the programs increase. Nevertheless, our approach has a lot of potential, encouraging its usage in future works. / WebAssembly est un nouveau standard du World Wide Web utilisé comme cible de compilation et qui est principalement destiné à exécuter des applications dans un navigateur Web avec des performances supérieures. À mesure que le langage devient populaire, le besoin en rétro-ingénierie des fichiers WebAssembly binaires se ressent. Toutefois, la construction d’un bon décompilateur capable de restaurer du code source plus aussi proche que possible de l’original est une tâche compliquée. Récemment, la traduction automatique neuronale a été proposée comme alternative aux décompilateurs traditionnels qui impliquent du travail fastidieux, coûteux et difficilement adaptable à d’autres langages. Nous investiguons les chances de succès de la traduction automatique neuronale pour décompiler des fichiers binaires WebAssembly en code source C. Les modèles du transformeur et du LSTM séquence-à-séquence (Seq2Seq) sont utilisés. Nous construisons un jeu de données généré de manière aléatoire constitué de paires de code source WebAssembly et C et nous utilisons différentes métriques pour évaluer quantitativement les performances des deux modèles. Notre implémentation consiste en plusieurs phases de traitement qui reçoivent en entrée le code WebAssembly et produisent en sortie le code source C. Les résultats montrent que le transformeur est plus performant que le modèle basé sur les LSTMs. De plus, bien que le modèle puisse restaurer la syntaxe ainsi que la structure de contrôle du programme avec jusqu’à 95% de précision, il est incapable de produire un flux de données équivalent. Les différents jeux de données produits indiquent une chute de performance à mesure que la complexité cyclomatique ainsi que le niveau d’imbrication augmentent. Nous estimons, toutefois, que cette approche possède du potentiel. / WebAssembluy är en ny standard för World Wide Web som används som ett kompileringsmål och som är tänkt att möjliggöra högpresterande applikationer i webbläsaren. När det blir mer populärt ökar behovet av motsvarande dekompilatorer. Att bygga en exakt dekompilator som kan återställa den ursprungliga källkoden är dock en komplicerad uppgift. Nyligen har Neural Maskinöversättning (NMT) föreslagits som ett alternativ till traditionella dekompilatorer som innebär mycket manuellt och mödosamt arbete. Vi undersöker genomförbarheten hos Neural Maskinöversättning för dekompilering av WebAssembly -binärer till C -källkod. De toppmoderna transformer och LSTM sequence-to-sequence (Seq2Seq) modellerna med attention experimenteras med. Vi bygger en anpassad slumpmässigt genererad dataset för WebAssembly till C-källkodspar och använder olika mätvärden för att kvantitativt utvärdera modellernas prestanda. Vår implementering består av flera bearbetningssteg som har WebAssembly -ingången och C -utgången som slutpunkter. Resultaten visar att transformer överträffar den LSTM -baserade neuralmodellen. Även om modellen återställer syntaxen och kontrollflödesstrukturen med upp till 95 % noggrannhet, är den oförmögen att återställa dataflödet. De olika benchmarks som vi använder vår utvärdering på indikerar en minskning av dekompilationsnoggrannheten när den cyklomatiska komplexiteten och häckningen av programmen ökar. Vi tror dock att detta tillvägagångssätt har stor potential. Decompilation WebAssembly Reverse Engineering Neural Machine Translation Décompilation WebAssembly Rétro-Ingénierie Traduction Automatique Neuronale Dekompilering WebAssembly Reverse Engineering Neural Maskinöversättning Computer Sciences Datavetenskap (datalogi)
18	Utvärdering av domänanpassning i maskinöversättningssystem för användning inom MyScania / Evaluation of domain customization in machine translation systems for use in MyScania Olofsson, Martin, Larsson, Jesper January 2022 (has links) Denna rapport syftar primärt till att undersöka hur väl system för maskinöversättning kan prestera i relation till Scanias kravbild. Undersökningen riktar sig främst till att undersöka systemens förmåga till domänanpassning och vilken effekt det har på dess maskinöversättningar. Utvärdering görs dels med automatiska utvärderingsmetoder som på olika sätt mäter korrelation till existerande textinnehåll från diverse tjänster i samlingsplattformen MyScania, men även manuellt av översättare med erfarenhet inom Scanias språkbruk. Resultatet av denna undersökning visade att domänanpassning med egna data generellt ökar kvaliteten av maskinöversättningar. Det noteras även hur väl maskinöversättningarna presterar varierar mycket på faktorer som exempelvis språk. Google AutoML lyckas däremot prestera bäst i alla de testade språken. Detta visades vid både automatisk utvärdering och manuell utvärdering. Undersökningen visade även svagheter i automatisk utvärderingsmetrik vid fristående användning men samtidigt att det kan bidra med meningsfulla insikter när det kompletteras med mänsklig bedömning. Undersökningen bekräftar att mänsklig bedömning alltid bör användas om det är möjligt. / This report’s primary purpose is to examine how well systems for machine translation can perform in relation to what is sought after by Scania. This examination is primarily aimed at investigating the systems capability for domain customization and what effects these have on the results of machine translations. Evaluation is done partly using multiple automatic metrics that in different ways measure correlation to existing translations within MyScania, combined with manual evaluation done by translators experienced with Scania’s language usage. The results of this examination showed that domain customization using own data generally increases the quality of machine translations. It is noted that how the machine translations perform is affected by many factors such as languages, Google AutoML however succeeds to perform the best in all the tested languages. This is proven both in evaluation using automatic metrics and manual evaluation. This investigation also showed weaknesses in automatic metrics in stand-alone use but that they can contribute with meaningful knowledge when complemented by manual evaluation. This investigation confirms that manual evaluation should always be used when possible. machine translation neural machine translation domain customization automatic metrics manual evaluation maskinöversättning neural maskinöversättning domänanpassning automatisk utvärderingsmetrik manuell utvärdering Computer Sciences Datavetenskap (datalogi)
19	Neural maskinöversättning av gawarbati / Neural machine translation for Gawarbati Gillholm, Katarina January 2023 (has links) Nya neurala modeller har lett till stora framsteg inom maskinöversättning, men fungerar fortfarande sämre på språk som saknar stora mängder parallella data, så kallade lågresursspråk. Gawarbati är ett litet, hotat lågresursspråk där endast 5000 parallella meningar finns tillgängligt. Denna uppsats använder överföringsinlärning och hyperparametrar optimerade för små datamängder för att undersöka möjligheter och begränsningar för neural maskinöversättning från gawarbati till engelska. Genom att använda överföringsinlärning där en föräldramodell först tränades på hindi-engelska förbättrades översättningar med 1.8 BLEU och 1.3 chrF. Hyperparametrar optimerade för små datamängder ökade BLEU med 0.6 men minskade chrF med 1. Att kombinera överföringsinlärning och hyperparametrar optimerade för små datamängder försämrade resultatet med 0.5 BLEU och 2.2 chrF. De neurala modellerna jämförs med och presterar bättre än ordbaserad statistisk maskinöversättning och GPT-3. Den bäst presterande modellen uppnådde endast 2.8 BLEU och 19 chrF, vilket belyser begränsningarna av maskinöversättning på lågresursspråk samt det kritiska behovet av mer data. / Recent neural models have led to huge improvements in machine translation, but performance is still suboptimal for languages without large parallel datasets, so called low resource languages. Gawarbati is a small, threatened low resource language with only 5000 parallel sentences. This thesis uses transfer learning and hyperparameters optimized for small datasets to explore possibilities and limitations for neural machine translation from Gawarbati to English. Transfer learning, where the parent model was trained on parallel data between Hindi and English, improved results by 1.8 BLEU and 1.3 chrF. Hyperparameters optimized for small datasets increased BLEU by 0.6 but decreased chrF by 1. Combining transfer learning and hyperparameters optimized for small datasets led to a decrease in performance by 0.5 BLEU and 2.2 chrF. The neural models outperform a word based statistical machine translation and GPT-3. The highest performing model only achieved 2.8 BLEU and 19 chrF, which illustrates the limitations of machine translation for low resource languages and the critical need for more data. / VR 2020-01500 Machine translation neural machine translation NMT low resource language Gawarbati transfer learning GPT Maskinöversättning neural maskinöversättning NMT lågresursspråk gawarbati överföringsinlärning GPT
20	Spelling Normalization of English Student Writings HONG, Yuchan January 2018 (has links) Spelling normalization is the task to normalize non-standard words into standard words in texts, resulting in a decrease in out-of-vocabulary (OOV) words in texts for natural language processing (NLP) tasks such as information retrieval, machine translation, and opinion mining, improving the performance of various NLP applications on normalized texts. In this thesis, we explore diﬀerent methods for spelling normalization of English student writings including traditional Levenshtein edit distance comparison, phonetic similarity comparison, character-based Statistical Machine Translation (SMT) and character-based Neural Machine Translation (NMT) methods. An important improvement of our implementation is that we develop an approach combining Levenshtein edit distance and phonetic similarity methods with added components of frequency count and compound splitting and it is evaluated as a best approach with 0.329% accuracy improvement and 63.63% error reduction on the original unnormalized test set. spelling normalization English student writings phonetic similarity comparison Levenshtein edit distance General Language Studies and Linguistics

Search results