211 |
Swedish-English Verb Frame Divergences in a Bilingual Head-driven Phrase Structure Grammar for Machine Translation / Skillnader i verbramar mellan svenska och engelska i en tvåspråkig HPSG-grammatik för maskinöversättningStymne, Sara January 2006 (has links)
<p>In this thesis I have investigated verb frame divergences in a bilingual Head-driven Phrase Structure Grammar for machine translation. The purpose was threefold: (1) to describe and classify verb frame divergences (VFDs) between Swedish and English, (2) to practically implement a bilingual grammar that covered many of the identified VFDs and (3) to find out what cases of VFDs could be solved and implemented using a common semantic representation, or interlingua, for Swedish and English.</p><p>The implemented grammar, BiTSE, is a Head-driven Phrase Structure Grammar based on the LinGO Grammar Matrix, a language independent grammar base. BiTSE is a bilingual grammar containing both Swedish and English. The semantic representation used is Minimal Recursion Semantics (MRS). It is language independent, so generating from it gives all equivalent sentences in both Swedish and English. Both the core of the languages and a subset of the identified VFDs are successfully implemented in BiTSE. For other VFDs tentative solutions are discussed.</p><p>MRS have previously been proposed as suitable for semantic transfer machine translation. I have shown that VFDs can naturally be handled by an interlingual design in many cases, minimizing the need of transfer.</p><p>The main contributions of this thesis are: an inventory of English and Swedish verb frames and verb frame divergences; the bilingual grammar BiTSE and showing that it is possible in many cases to use MRS as an interlingua in machine translation.</p>
|
212 |
Continuous space models with neural networks in natural language processingLe, Hai Son 20 December 2012 (has links) (PDF)
The purpose of language models is in general to capture and to model regularities of language, thereby capturing morphological, syntactical and distributional properties of word sequences in a given language. They play an important role in many successful applications of Natural Language Processing, such as Automatic Speech Recognition, Machine Translation and Information Extraction. The most successful approaches to date are based on n-gram assumption and the adjustment of statistics from the training data by applying smoothing and back-off techniques, notably Kneser-Ney technique, introduced twenty years ago. In this way, language models predict a word based on its n-1 previous words. In spite of their prevalence, conventional n-gram based language models still suffer from several limitations that could be intuitively overcome by consulting human expert knowledge. One critical limitation is that, ignoring all linguistic properties, they treat each word as one discrete symbol with no relation with the others. Another point is that, even with a huge amount of data, the data sparsity issue always has an important impact, so the optimal value of n in the n-gram assumption is often 4 or 5 which is insufficient in practice. This kind of model is constructed based on the count of n-grams in training data. Therefore, the pertinence of these models is conditioned only on the characteristics of the training text (its quantity, its representation of the content in terms of theme, date). Recently, one of the most successful attempts that tries to directly learn word similarities is to use distributed word representations in language modeling, where distributionally words, which have semantic and syntactic similarities, are expected to be represented as neighbors in a continuous space. These representations and the associated objective function (the likelihood of the training data) are jointly learned using a multi-layer neural network architecture. In this way, word similarities are learned automatically. This approach has shown significant and consistent improvements when applied to automatic speech recognition and statistical machine translation tasks. A major difficulty with the continuous space neural network based approach remains the computational burden, which does not scale well to the massive corpora that are nowadays available. For this reason, the first contribution of this dissertation is the definition of a neural architecture based on a tree representation of the output vocabulary, namely Structured OUtput Layer (SOUL), which makes them well suited for large scale frameworks. The SOUL model combines the neural network approach with the class-based approach. It achieves significant improvements on both state-of-the-art large scale automatic speech recognition and statistical machine translations tasks. The second contribution is to provide several insightful analyses on their performances, their pros and cons, their induced word space representation. Finally, the third contribution is the successful adoption of the continuous space neural network into a machine translation framework. New translation models are proposed and reported to achieve significant improvements over state-of-the-art baseline systems.
|
213 |
Generierung von natürlichsprachlichen Texten aus semantischen Strukturen im Prozeß der maschinellen Übersetzung - Allgemeine Strukturen und AbbildungenRosenpflanzer, Lutz, Karl, Hans-Ulrich 14 December 2012 (has links) (PDF)
0 VORWORT
Bei der maschinellen Übersetzung natürlicher Sprache dominieren mehrere Probleme. Man hat es immer mit sehr großen Datenmengen zu tun. Auch wenn man nur einen kleinen Text übersetzen will, ist diese Aufgabe in umfänglichen Kontext eingebettet, d.h. alles Wissen über Quell- und Zielsprache muß - in möglichst formalisierter Form - zur Verfügung stehen. Handelt es sich um gesprochenes Wort treten Spracherkennungs- und Sprachausgabeaufgaben sowie harte Echtzeitforderungen hinzu. Die Komplexität des Problems ist - auch unter Benutzung moderner Softwareentwicklungskonzepte - für jeden, der eine Implementation versucht, eine nicht zu unterschätzende Herausforderung.
Ansätze, die die Arbeitsprinzipien und Methoden der Informatik konsequent nutzen, stellen ihre Ergebnisse meist nur prototyisch für einen sehr kleinen Teil der Sprache -etwa eine Phrase, einen Satz bzw. mehrere Beispielsätze- heraus und folgern mehr oder weniger induktiv, daß die entwickelte Lösung auch auf die ganze Sprache erfolgreich angewendet werden kann, wenn man nur genügend „Lemminge“ hat, die nach allen Seiten ausschwärmend, die „noch notwendigen Routinearbeiten“ schnell und bienenfleißig ausführen könnten.
|
214 |
Intégration du contexte en traduction statistique à l’aide d’un perceptron à plusieurs couchesPatry, Alexandre 04 1900 (has links)
Les systèmes de traduction statistique à base de segments traduisent les phrases
un segment à la fois, en plusieurs étapes. À chaque étape, ces systèmes ne considèrent que très peu d’informations pour choisir la traduction d’un segment. Les
scores du dictionnaire de segments bilingues sont calculés sans égard aux contextes dans lesquels ils sont utilisés et les modèles de langue ne considèrent que les
quelques mots entourant le segment traduit.Dans cette thèse, nous proposons un nouveau modèle considérant la phrase en
entier lors de la sélection de chaque mot cible. Notre modèle d’intégration du
contexte se différentie des précédents par l’utilisation d’un ppc (perceptron à plusieurs couches). Une propriété intéressante des ppc est leur couche cachée, qui propose une représentation alternative à celle offerte par les mots pour encoder
les phrases à traduire. Une évaluation superficielle de cette représentation alter-
native nous a montré qu’elle est capable de regrouper certaines phrases sources
similaires même si elles étaient formulées différemment. Nous avons d’abord comparé avantageusement les prédictions de nos ppc à celles
d’ibm1, un modèle couramment utilisé en traduction. Nous avons ensuite intégré
nos ppc à notre système de traduction statistique de l’anglais vers le français. Nos ppc ont amélioré les traductions de notre système de base et d’un deuxième système de référence auquel était intégré IBM1. / Phrase-based statistical machine translation systems translate source sentences
one phrase at a time, conditioning the choice of each phrase on very little information. Bilingual phrase table scores are computed regardless of the context in which the phrases are used and language models only look at few words surrounding
the target phrases.
In this thesis, we propose a novel model to predict words that should appear in
a translation given the source sentence as a whole. Our model differs from previous works by its use of mlp (multilayer perceptrons). Our interest in mlp lies in their hidden layer that encodes source sentences in a representation that is only loosely tied to words. We observed that this hidden layer was able to cluster some sentences having similar translations even if they were formulated differently.
In a first set of experiments, we compared favorably our mlp to ibm1, a well known
model in statistical machine translation. In a second set of experiments, we embedded our ppc in our English to French statistical machine translation system. Our MLP improved translations quality over our baseline system and a second system embedding an IBM1 model.
|
215 |
Déploiement automatique d’une application de routage téléphonique d’une langue source vers une langue cibleTremblay, Jérôme 08 1900 (has links)
Les modèles de compréhension statistiques appliqués à des applications vocales nécessitent beaucoup de données pour être entraînés. Souvent, une même application doit pouvoir supporter plusieurs langues, c’est le cas avec les pays ayant plusieurs langues officielles. Il s’agit donc de gérer les mêmes requêtes des utilisateurs, lesquelles présentent une sémantique similaire, mais dans plusieurs langues différentes. Ce projet présente des techniques pour déployer automatiquement un modèle de compréhension statistique d’une langue source vers une langue cible. Ceci afin de réduire le nombre de données nécessaires ainsi que le temps relié au déploiement d’une application dans une nouvelle langue.
Premièrement, une approche basée sur les techniques de traduction automatique est présentée. Ensuite une approche utilisant un espace sémantique commun pour comparer plusieurs langues a été développée. Ces deux méthodes sont comparées pour vérifier leurs limites et leurs faisabilités. L’apport de ce projet se situe dans l’amélioration d’un modèle de traduction grâce à l’ajout de données très proche de l’application ainsi que d’une nouvelle façon d’inférer un espace sémantique multilingue. / Statistical understanding models applied to dialog applications need a lot of training data. Often, an application needs to support more than one language. This is relevant for countries that have more than one official language. In those applications, users queries convey the same meanings but in different languages. This project presents techniques to automatically deploy statistical comprehension models from a source language to a target language. The goal is to reduce the training data needed and the time requiered to deploy an application in a new language. First, an approach using machine translation techniques is presented. Then, an approach that uses a common semantic space to compare both languages has been developed. Those methods are compared to verify their limits and feasibility. This work present an improvement of the translation model using in-domain data and a novel technique for inferring a multilingual semantic space
|
216 |
Kelių automatinio vertimo sistemų integracija / The integration of several automatic translation systemsMarin, Igor 23 July 2012 (has links)
Baigiamajame magistro darbe nagrinėjamos automatinės vertimo sistemos, pagrindiniai jų veikimo principai ir šių sistemų integracijos būdai. Detaliai aprašoma populiarių šiuo metu statistinių vertimo sistemų struktūra, pateikiami šių ir tradicinių (taisyklėmis paremtų) sistemų privalumai ir trūkumai. Pristatomos visos šiuo metu egzistuojančios automatinio vertimo sistemos lietuvių ir anglų kalbų porai, išaiškinami jų privalumai ir trūkumai. Lingvistiniu požiūriu nagrinėjamos lietuvių ir anglų kalbos, išvardinami šių kalbų panašumai, skirtumai ir sunkumai, kylantys verčiant iš vienos kalbos į kitą. Taip pat pateikiami įvairūs automatinio vertimo įvertinimo būdai, įskaitant populiarų BLEU įvertinimo metodą. Išvardinamos ir analizuojamos užsienio autorių siūlomos automatinio vertimo sistemų integracijos architektūros. Apžvelgiami sumaišymo tinklai, kurie naudojami kuriant integruotą vertimo sistemą. Pateikiama originali mišriosios vertimo sistemos įgyvendinimo metodika. Integruota sistema yra praktiškai įgyvendinama. Šios sistemos ir kitų vertimo sistemų anglų ir lietuvių kalbų porai rezultatai yra įvertinami ir palyginami. Atlikus teorinę automatinio vertimo sistemų apžvalgą ir praktiškai įgyvendinus mišriąją vertimo sistemą, pateikiamos baigiamojo darbo išvados ir siūlymai.
Darbą sudaro 6 dalys: įvadas, automatinio vertimo sistemų analizė, mišriųjų automatinio vertimo sistemų analizė, mišriosios automatinio vertimo sistemos sukūrimas, išvados, literatūros sąrašas.
Darbo apimtis... [toliau žr. visą tekstą] / The Master’s thesis analyses machine translation systems, their principles of operation and the methods used in integrating these systems. The structure of popular statistical machine translation systems, as well as the advantages and disadvantages of such systems is described in detali. The existing machine translation systems for the Lithuanian-English language pair along with their abilities and shortcomings are presented. Lithuanian and English languages are analysed from the linguistic perspective. The similarities and differences between these languages, as well as the difficulties, arising in translating the text from one language to another are discussed. Moreover, different machine translation evaluation methods, including the popular BLEU metric, are reviewed. Various architectures for integrating multiple machine translation systems, offered by foreign authors, are presented and analysed. Confusion networks, which are used in integrating machine translation systems, are discussed. An original method of implementing the hybrid machine translation system is offered. The hybrid system is implemented in practice. The translation results obtained from the created system and the existing systems for the Lithuanian-English language pair are assessed and compared. Finally, after performing the theoretical review of machine translation systems and implementing the hybrid system, conclusions and recommendations are provided.
The thesis consists of 6 parts: introduction, the... [to full text]
|
217 |
Algebraic decoder specification: coupling formal-language theory and statistical machine translationBüchse, Matthias 28 January 2015 (has links) (PDF)
The specification of a decoder, i.e., a program that translates sentences from one natural language into another, is an intricate process, driven by the application and lacking a canonical methodology. The practical nature of decoder development inhibits the transfer of knowledge between theory and application, which is unfortunate because many contemporary decoders are in fact related to formal-language theory. This thesis proposes an algebraic framework where a decoder is specified by an expression built from a fixed set of operations. As yet, this framework accommodates contemporary syntax-based decoders, it spans two levels of abstraction, and, primarily, it encourages mutual stimulation between the theory of weighted tree automata and the application.
|
218 |
Combining machine learning and rule-based approaches in Spanish syntactic generationMelero Nogués, Maria Teresa 02 June 2006 (has links)
Aquesta tesi descriu una gramàtica de Generació que combina regles escrites a mà i tècniques d'aprenentatge automàtic. Aquesta gramàtica pertany a un sistema de Traducció Automàtica de qualitat comercial desenvolupat a Microsoft Research. La primera part presenta la gramàtica i les principals estratègies lingüístiques que aquesta gramàtica implementa. Els requeriments de robustesa que reclama l'ús real del sistema de TA, exigeix del Generador un esforç suplementari que es resol afegint un nivell de pre-generació, capaç de garantir la integritat de l'entrada, sense incorporar elements ad-hoc en les regles de la gramàtica. A la segona part, explorem l'ús dels classificadors d'arbres de decisió (DT) per tal d'aprendre automàticament una de les operacions que tenen lloc al mòdul de pre-generació, en concret la selecció lèxica del verb copulatiu en espanyol (ser o estar). Mostrem que és possible inferir a partir d'exemples els contextos per aquest fenòmen lingüístic no trivial, amb gran precisió. / This thesis describes a Spanish Generation grammar which combines hand-written rules and Machine Learning techniques. This grammar belongs to a full-scale commercial quality Machine Translation system developed at Microsoft Research. The first part presents the grammar and the linguistic strategies it embodies. The need for robustness in real-world situations in the everyday use of the MT system requires from the Generator an extra effort which is resolved by adding a Pre-Generation layer which is able to fix the input to Generation, without contaminating the grammar rules. In the second part we explore the use of Decision Tree classifiers (DT) for automatically learning one of the operations that take place in the Pre-Generation component, namely lexical selection of the Spanish copula (i.e. ser and estar). We show that it is possible to infer from examples the contexts for this non-trivial linguistic phenomenon with high accuracy.
|
219 |
Attitydanalys av svenska produktomdömen – behövs språkspecifika verktyg? / Sentiment Analysis of Swedish Product Reviews – Are Language-specific Tools Necessary?Glant, Oliver January 2018 (has links)
Sentiment analysis of Swedish data is often performed using English tools and machine. This thesis compares using a neural network trained on Swedish data with a corresponding one trained on English data. Two datasets were used: approximately 200,000 non-neutral Swedish reviews from the company Prisjakt Sverige AB, one of the largest annotated datasets used for Swedish sentiment analysis, and 1,000,000 non-neutral English reviews from Amazon.com. Both networks were evaluated on 11,638 randomly selected reviews, in Swedish and in English machine translation. The test set had the same overrepresentation of positive reviews as the Swedish dataset (84% were positive). The results suggest that English tools can be used with machine translation for sentiment analysis of Swedish reviews, without loss of classification ability. However, the English tool required 33% more training data to achieve maximum performance. Evaluation on the unbalanced test set required extra consideration regarding statistical measures. F1-measure turned out to be reliable only when calculated for the underrepresented class. It then showed a strong correlation with the Matthews correlation coefficient, which has been found to be more reliable. This warrants further investigation into whether the correlation is valid for all different balances, which would simplify comparison between studies. / Attitydanalys av svensk data sker i många fall genom maskinöversättning till engelska för att använda tillgängliga analysverktyg. I den här uppsatsen undersöktes skillnaden mellan användning av ett neuronnät tränat på svensk data och av motsvarande neuronnät tränat på engelsk data. Två datamängder användes: cirka 200 000 icke-neutrala svenska produktomdömen från Prisjakt Sverige AB, en av de största annoterade datamängder som använts för svensk attitydanalys, och 1 000 000 icke-neutrala engelskaproduktomdömen från Amazon.com. Båda versionerna av neuronnätet utvärderades på 11 638 slumpmässigt utvalda svenska produktomdömen, i original och maskinöversatta till engelska. Testmängden hade samma överrepresentation av positiva omdömen som den svenska datamängden (84% positiva omdömen). Resultaten tyder på att engelska verktyg med hjälp av maskinöversättning kan användas för attitydanalys av svenska produktomdömen med bibehållen klassificeringsförmåga, dock krävdes cirka 33% större träningsdata för att det engelska verktyget skulle uppnå maximal klassificeringsförmåga. Utvärdering på den obalanserade datamängden visade sig ställa särskilda krav på de statistiska mått som användes. F1-värde fungerade tillfredsställande endast när det beräknades för den underrepresenterade klassen. Det korrelerade då starkt med Matthews korrelationskoefficient, som tidigare funnits vara ett pålitligare mått. Om korrelationen gäller vid alla olika balanser skulle jämförelser mellan olika studiers resultat underlättas, något som bör undersökas.
|
220 |
The mat sat on the cat : investigating structure in the evaluation of order in machine translationMcCaffery, Martin January 2017 (has links)
We present a multifaceted investigation into the relevance of word order in machine translation. We introduce two tools, DTED and DERP, each using dependency structure to detect differences between the structures of machine-produced translations and human-produced references. DTED applies the principle of Tree Edit Distance to calculate edit operations required to convert one structure into another. Four variants of DTED have been produced, differing in the importance they place on words which match between the two sentences. DERP represents a more detailed procedure, making use of the dependency relations between words when evaluating the disparities between paths connecting matching nodes. In order to empirically evaluate DTED and DERP, and as a standalone contribution, we have produced WOJ-DB, a database of human judgments. Containing scores relating to translation adequacy and more specifically to word order quality, this is intended to support investigations into a wide range of translation phenomena. We report an internal evaluation of the information in WOJ-DB, then use it to evaluate variants of DTED and DERP, both to determine their relative merit and their strength relative to third-party baselines. We present our conclusions about the importance of structure to the tools and their relevance to word order specifically, then propose further related avenues of research suggested or enabled by our work.
|
Page generated in 0.0404 seconds