191 |
On improving natural language processing through phrase-based and one-to-one syntactic algorithmsMeyer, Christopher Henry January 1900 (has links)
Master of Science / Department of Computing and Information Sciences / William H. Hsu / Machine Translation (MT) is the practice of using computational methods to convert words from one natural language to another. Several approaches have been created since MT’s inception in the 1950s and, with the vast increase in computational resources since then, have continued to evolve and improve. In this thesis I summarize several branches of MT theory and introduce several newly developed software applications, several parsing techniques to improve Japanese-to-English text translation, and a new key algorithm to correct translation errors when converting from Japanese kanji to English. The overall translation improvement is measured using the BLEU metric (an objective, numerical standard in Machine Translation quality analysis). The baseline translation system was built by combining Giza++, the Thot Phrase-Based SMT toolkit, the SRILM toolkit, and the Pharaoh decoder. The input and output parsing applications were created as intermediary to improve the baseline MT system as to eliminate artificially high improvement metrics. This baseline was measured with and without the additional parsing provided by the thesis software applications, and also with and without the thesis kanji correction utility.
The new algorithm corrected for many contextual definition mistakes that are common when converting from Japanese to English text. By training the new kanji correction utility on an existing dictionary, identifying source text in Japanese with a high number of possible translations, and checking the baseline translation against other translation possibilities; I was able to increase the translation performance of the baseline system from minimum normalized BKEU scores of .0273 to maximum normalized scores of .081.
The preliminary phase of making improvements to Japanese-to-English translation focused on correcting segmentation mistakes that occur when attempting to parse Japanese text into meaningful tokens. The initial increase is not indicative of future potential and is artificially high as the baseline score was so low to begin with, but was needed to create a reasonable baseline score.
The final results of the tests confirmed that a significant, measurable improvement had been achieved through improving the initial segmentation of the Japanese text through parsing the input corpora and through correcting kanji translations after the Pharaoh decoding process had completed.
|
192 |
Spelling Normalisation and Linguistic Analysis of Historical Text for Information ExtractionPettersson, Eva January 2016 (has links)
Historical text constitutes a rich source of information for historians and other researchers in humanities. Many texts are however not available in an electronic format, and even if they are, there is a lack of NLP tools designed to handle historical text. In my thesis, I aim to provide a generic workflow for automatic linguistic analysis and information extraction from historical text, with spelling normalisation as a core component in the pipeline. In the spelling normalisation step, the historical input text is automatically normalised to a more modern spelling, enabling the use of existing taggers and parsers trained on modern language data in the succeeding linguistic analysis step. In the final information extraction step, certain linguistic structures are identified based on the annotation labels given by the NLP tools, and ranked in accordance with the specific information need expressed by the user. An important consideration in my implementation is that the pipeline should be applicable to different languages, time periods, genres, and information needs by simply substituting the language resources used in each module. Furthermore, the reuse of existing NLP tools developed for the modern language is crucial, considering the lack of linguistically annotated historical data combined with the high variability in historical text, making it hard to train NLP tools specifically aimed at analysing historical text. In my evaluation, I show that spelling normalisation can be a very useful technique for easy access to historical information content, even in cases where there is little (or no) annotated historical training data available. For the specific information extraction task of automatically identifying verb phrases describing work in Early Modern Swedish text, 91 out of the 100 top-ranked instances are true positives in the best setting.
|
193 |
以範例為基礎之英漢TIMSS詴題輔助翻譯 / Using Example-based Translation Techniques for Computer Assisted Translation of TIMSS Test Items張智傑, Chang, Chih Chieh Unknown Date (has links)
本論文應用以範例為基礎的機器翻譯技術,應用英漢雙語對應的結構輔助英漢單句語料的翻譯。翻譯範例是運用一種特殊的結構,此結構包含來源句的剖析樹、目標句的字串、以及目標句和來源句詞彙對應關係。將翻譯範例建立資料庫,以提供來源句作詞序交換的依據,接著透過字典翻譯,以及利用統計式中英詞彙對列和語言模型來選詞,最後填補缺少的量詞,產生建議的翻譯。我們是以2003年國際數學與科學教育成就趨勢調查測驗詴題為主要翻譯的對象,以期提升翻譯的一致性和效率。以NIST 和BLEU 的評比方式,來評估和比較Google Translate 和Yahoo!線上翻譯系統及本系統所達成的翻譯品質。我們的系統經過詞序調動以及填補量詞後,翻譯品質比我們前一代系統要佳,但整體效果沒有比Google Translate 和Yahoo!線上翻譯的品質要佳。 / This paper presents an example-based machine translation based on bilingual structured string tree correspondence (BSSTC). The BSSTC structure includes a parse tree in source language, a string in target language and the correspondence between the source language tree and the target language string. / We designed an English to Chinese computer assisted translation system for Trends in International Mathematics and Science Study (TIMSS), through the BSSTC structure reordering, directory translation, choosing translation statistics model and measure word generation. / We evaluated our system by the BLEU and NIST score and compared with Google Translate and Yahoo! Translate. By reordering selected word sequences and inserting measure words in the default translations, the current system achieved a higher quality of default translations than the previous implementation of our research group, but the overall effects still lag behind that achieved by Google and Yahoo!.
|
194 |
Probabilistic inference for phrase-based machine translation : a sampling approachArun, Abhishek January 2011 (has links)
Recent advances in statistical machine translation (SMT) have used dynamic programming (DP) based beam search methods for approximate inference within probabilistic translation models. Despite their success, these methods compromise the probabilistic interpretation of the underlying model thus limiting the application of probabilistically defined decision rules during training and decoding. As an alternative, in this thesis, we propose a novel Monte Carlo sampling approach for theoretically sound approximate probabilistic inference within these models. The distribution we are interested in is the conditional distribution of a log-linear translation model; however, often, there is no tractable way of computing the normalisation term of the model. Instead, a Gibbs sampling approach for phrase-based machine translation models is developed which obviates the need of computing this term yet produces samples from the required distribution. We establish that the sampler effectively explores the distribution defined by a phrase-based models by showing that it converges in a reasonable amount of time to the desired distribution, irrespective of initialisation. Empirical evidence is provided to confirm that the sampler can provide accurate estimates of expectations of functions of interest. The mix of high probability and low probability derivations obtained through sampling is shown to provide a more accurate estimate of expectations than merely using the n-most highly probable derivations. Subsequently, we show that the sampler provides a tractable solution for finding the maximum probability translation in the model. We also present a unified approach to approximating two additional intractable problems: minimum risk training and minimum Bayes risk decoding. Key to our approach is the use of the sampler which allows us to explore the entire probability distribution and maintain a strict probabilistic formulation through the translation pipeline. For these tasks, sampling allies the simplicity of n-best list approaches with the extended view of the distribution that lattice-based approaches benefit from, while avoiding the biases associated with beam search. Our approach is theoretically well-motivated and can give better and more stable results than current state of the art methods.
|
195 |
Stream-based statistical machine translationLevenberg, Abby D. January 2011 (has links)
We investigate a new approach for SMT system training within the streaming model of computation. We develop and test incrementally retrainable models which, given an incoming stream of new data, can efficiently incorporate the stream data online. A naive approach using a stream would use an unbounded amount of space. Instead, our online SMT system can incorporate information from unbounded incoming streams and maintain constant space and time. Crucially, we are able to match (or even exceed) translation performance of comparable systems which are batch retrained and use unbounded space. Our approach is particularly suited for situations when there is arbitrarily large amounts of new training material and we wish to incorporate it efficiently and in small space. The novel contributions of this thesis are: 1. An online, randomised language model that can model unbounded input streams in constant space and time. 2. An incrementally retrainable translationmodel for both phrase-based and grammarbased systems. The model presented is efficient enough to incorporate novel parallel text at the single sentence level. 3. Strategies for updating our stream-based language model and translation model which demonstrate how such components can be successfully used in a streaming translation setting. This operates both within a single streaming environment and also in the novel situation of having to translate multiple streams. 4. Demonstration that recent data from the stream is beneficial to translation performance. Our stream-based SMT system is efficient for tackling massive volumes of new training data and offers-up new ways of thinking about translating web data and dealing with other natural language streams.
|
196 |
Intégration du contexte en traduction statistique à l’aide d’un perceptron à plusieurs couchesPatry, Alexandre 04 1900 (has links)
Les systèmes de traduction statistique à base de segments traduisent les phrases
un segment à la fois, en plusieurs étapes. À chaque étape, ces systèmes ne considèrent que très peu d’informations pour choisir la traduction d’un segment. Les
scores du dictionnaire de segments bilingues sont calculés sans égard aux contextes dans lesquels ils sont utilisés et les modèles de langue ne considèrent que les
quelques mots entourant le segment traduit.Dans cette thèse, nous proposons un nouveau modèle considérant la phrase en
entier lors de la sélection de chaque mot cible. Notre modèle d’intégration du
contexte se différentie des précédents par l’utilisation d’un ppc (perceptron à plusieurs couches). Une propriété intéressante des ppc est leur couche cachée, qui propose une représentation alternative à celle offerte par les mots pour encoder
les phrases à traduire. Une évaluation superficielle de cette représentation alter-
native nous a montré qu’elle est capable de regrouper certaines phrases sources
similaires même si elles étaient formulées différemment. Nous avons d’abord comparé avantageusement les prédictions de nos ppc à celles
d’ibm1, un modèle couramment utilisé en traduction. Nous avons ensuite intégré
nos ppc à notre système de traduction statistique de l’anglais vers le français. Nos ppc ont amélioré les traductions de notre système de base et d’un deuxième système de référence auquel était intégré IBM1. / Phrase-based statistical machine translation systems translate source sentences
one phrase at a time, conditioning the choice of each phrase on very little information. Bilingual phrase table scores are computed regardless of the context in which the phrases are used and language models only look at few words surrounding
the target phrases.
In this thesis, we propose a novel model to predict words that should appear in
a translation given the source sentence as a whole. Our model differs from previous works by its use of mlp (multilayer perceptrons). Our interest in mlp lies in their hidden layer that encodes source sentences in a representation that is only loosely tied to words. We observed that this hidden layer was able to cluster some sentences having similar translations even if they were formulated differently.
In a first set of experiments, we compared favorably our mlp to ibm1, a well known
model in statistical machine translation. In a second set of experiments, we embedded our ppc in our English to French statistical machine translation system. Our MLP improved translations quality over our baseline system and a second system embedding an IBM1 model.
|
197 |
Amélioration a posteriori de la traduction automatique par métaheuristiqueLavoie-Courchesne, Sébastien 03 1900 (has links)
La traduction automatique statistique est un domaine très en demande et où les machines sont encore loin de produire des résultats de qualité humaine. La principale méthode utilisée est une traduction linéaire segment par segment d'une phrase, ce qui empêche de changer des parties de la phrase déjà traduites. La recherche pour ce mémoire se base sur l'approche utilisée dans Langlais, Patry et Gotti 2007, qui tente de corriger une traduction complétée en modifiant des segments suivant une fonction à optimiser. Dans un premier temps, l'exploration de nouveaux traits comme un modèle de langue inverse et un modèle de collocation amène une nouvelle dimension à la fonction à optimiser. Dans un second temps, l'utilisation de différentes métaheuristiques, comme les algorithmes gloutons et gloutons randomisés permet l'exploration plus en profondeur de l'espace de recherche et permet une plus grande amélioration de la fonction objectif. / Statistical Machine Translation is a field ingreat demand and where machines are still far from producing human-level results.The main method used is a segment by segment linear translation of a sentence, which prevents modification of already translated parts of the sentence. Research for this memoir is based on an approach used by Langlais, Patry and Gotti 2007, which tries to correct a completed translation by modifying segments following a function which needs to be optimized. As a first step, exploration of new traits such as an inverted language model and a collocation model brings a new dimension to the optimization function. As a second step, use of different metaheuristics, such as the greedy and randomized greedy algorithms, allows greater depth while exploring the search space and allows a greater improvement of the objective function.
|
198 |
Probabilistic modelling of morphologically rich languagesBotha, Jan Abraham January 2014 (has links)
This thesis investigates how the sub-structure of words can be accounted for in probabilistic models of language. Such models play an important role in natural language processing tasks such as translation or speech recognition, but often rely on the simplistic assumption that words are opaque symbols. This assumption does not fit morphologically complex language well, where words can have rich internal structure and sub-word elements are shared across distinct word forms. Our approach is to encode basic notions of morphology into the assumptions of three different types of language models, with the intention that leveraging shared sub-word structure can improve model performance and help overcome data sparsity that arises from morphological processes. In the context of n-gram language modelling, we formulate a new Bayesian model that relies on the decomposition of compound words to attain better smoothing, and we develop a new distributed language model that learns vector representations of morphemes and leverages them to link together morphologically related words. In both cases, we show that accounting for word sub-structure improves the models' intrinsic performance and provides benefits when applied to other tasks, including machine translation. We then shift the focus beyond the modelling of word sequences and consider models that automatically learn what the sub-word elements of a given language are, given an unannotated list of words. We formulate a novel model that can learn discontiguous morphemes in addition to the more conventional contiguous morphemes that most previous models are limited to. This approach is demonstrated on Semitic languages, and we find that modelling discontiguous sub-word structures leads to improvements in the task of segmenting words into their contiguous morphemes.
|
199 |
Systèmes de compréhension et de traduction de la parole : vers une approche unifiée dans le cadre de la portabilité multilingue des systèmes de dialogue / Spoken language understanding and translation systems : a unified approach in a multilingual dialogue systems portability contextJabaian, Bassam 04 December 2012 (has links)
La généralisation de l’usage des systèmes de dialogue homme-machine accroît la nécessité du développement rapide des différents composants de ces systèmes. Les systèmes de dialogue peuvent être conçus pour différents domaines d’application et dans des langues différentes. La nécessité d’une production rapide pour de nouvelles langues reste un problème ouvert et crucial auquel il est nécessaire d’apporter des solutions efficaces.Nos travaux s’intéressent particulièrement au module de compréhension de la parole et proposent des approches pour la portabilité rapide peu coûteuse de ce module.Les méthodes statistiques ont montré de bonnes performances pour concevoir les modules de compréhension de la parole pour l’étiquetage sémantique de tours de dialogue.Cependant ces méthodes nécessitent de larges corpus pour être apprises. La collecte de ces corpus est aussi coûteuse en temps et en expertise humaine.Dans cette thèse, nous proposons plusieurs approches pour porter un système de compréhension d’une langue vers une autre en utilisant les techniques de la traduction automatique. Les premiers travaux consistent à appliquer la traduction automatique à plusieurs niveaux du processus de portabilité du système de compréhension afin de réduire le coût lié à production de nouvelles données d’apprentissage. Les résultats expérimentaux montrent que l’utilisation de la traduction automatique permet d’obtenir des systèmes performant avec un minimum de contribution humaine.Cette thèse traite donc à la fois de la traduction automatique et de la compréhension de la parole. Nous avons effectué une comparaison approfondie entre les méthodes utilisées pour chacune des tâches et nous avons proposé un décodage conjoint basé sur une méthode discriminante qui à la fois traduit une phrase et lui attribue ses étiquettes sémantiques. Ce décodage est obtenu par une approche à base de graphe qui permet de composer un graphe de traduction avec un graphe de compréhension. Cette représentation peut être généralisée pour permettre des transmissions d’informations riches entre les composants du système de dialogue / The generalisation of human-machine dialogue system increases the need for a rapid development of the various components of these systems. Dialogue systems can be designed for different applications and in different languages. The need for a fast production of systems for new languages is still an open and crucial issue which requires effective solutions. Our work is particularly interested in speech understanding module and propose approaches for language portability of this module. The statistical methods showed good performance to design modules for speech understanding. However, these methods require large corpora to be trained. The collection of these corpora is expensive in time and human expertise. In this thesis, we propose several approaches to port an understanding system from one language to another using machine translation techniques. The experimental results show that the use of machine translation allows to produce efficient systems with minimal human effort. This thesis addresses both machine translation and speech understanding domain. We conducted a comparison between the methods used for each task and we have proposed a joint decoding between translation and understanding based on a discriminant method. This decoding is achieved by a graph-based approach which allows to compose a translation graph with an understanding graph. This representation can be generalized to allow a rich transmission of information between the components of the dialogue system
|
200 |
Machine translation of proper names from english and french into vietnamese : an error analysis and some proposed solutions / Traduction automatique des noms propres de l’anglais et du français vers le vietnamien : analyse des erreurs et quelques solutionsPhan Thi Thanh, Thao 11 March 2014 (has links)
Dans l'ère de l'information et de la connaissance, la traduction automatique (TA) devientprogressivement un outil indispensable pour transposer la signification d'un texte d'une langue source versune langue cible. La TA des noms propres (NP), en particulier, joue un rôle crucial dans ce processus,puisqu'elle permet une identification précise des personnes, des lieux, des organisations et des artefacts àtravers les langues. Malgré un grand nombre d'études et des résultats significatifs concernant lareconnaissance d'entités nommées (dont le nom propre fait partie) dans la communauté de TAL dans lemonde, il n'existe presque aucune recherche sur la traduction automatique des noms propres (TANP) pourle vietnamien. En raison des caractéristiques différentes d'écriture de NP, la translittération ou la transcription etla traduction de plusieurs de langues incluant l'anglais, le français, le russe, le chinois, etc. vers levietnamien, le TANP de ces langues vers le vietnamien est stimulant et problématique. Cette étude seconcentre sur les problèmes de TANP d’anglais vers le vietnamien et de français vers le vietnamienrésultant du moteurs courants de la TA et présente les solutions de prétraitement de ces problèmes pouraméliorer la qualité de la TA. A travers l'analyse et la classification d'erreurs de la TANP faites sur deux corpus parallèles detextes avec PN (anglais-vietnamien et français-vietnamien), nous proposons les solutions concernant deuxproblématiques importantes: (1) l'annotation de corpus, afin de préparer des bases de données pour leprétraitement et (2) la création d'un programme pour prétraiter automatiquement les corpus annotés, afinde réduire les erreurs de la TANP et d'améliorer la qualité de traduction des systèmes de TA, tels queGoogle, Vietgle, Bing et EVTran. L'efficacité de différentes méthodes d'annotation des corpus avec des NP ainsi que les tauxd'erreurs de la TANP avant et après l'application du programme de prétraitement sur les deux corpusannotés est comparés et discutés dans cette thèse. Ils prouvent que le prétraitement réduitsignificativement le taux d'erreurs de la TANP et, par la même, contribue à l'amélioration de traductionautomatique vers la langue vietnamienne. / Machine translation (MT) has increasingly become an indispensable tool for decoding themeaning of a text from a source language into a target language in our current information and knowledgeera. In particular, MT of proper names (PN) plays a crucial role in providing the specific and preciseidentification of persons, places, organizations, and artefacts through the languages. Despite a largenumber of studies and significant achievements of named entity recognition in the NLP communityaround the world, there has been almost no research on PNMT for Vietnamese language. Due to the different features of PN writing, transliteration or transcription and translation from a variety of languages including English, French, Russian, Chinese, etc. into Vietnamese, the PNMT from those languages into Vietnamese is still challenging and problematic issue. This study focuses on theproblems of English-Vietnamese and French-Vietnamese PNMT arising from current MT engines. First,it proposes a corpus-based PN classification, then a detailed PNMT error analysis to conclude with somepre-processing solutions in order to improve the MT quality. Through the analysis and classification of PNMT errors from the two English-Vietnamese and French-Vietnamese parallel corpora of texts with PNs, we propose solutions concerning two major issues:(1)corpus annotation for preparing the pre-processing databases, and (2)design of the pre-processingprogram to be used on annotated corpora to reduce the PNMT errors and enhance the quality of MTsystems, including Google, Vietgle, Bing and EVTran. The efficacy of different annotation methods of English and French corpora of PNs and the results of PNMT errors before and after using the pre-processing program on the two annotated corporaare compared and discussed in this study. They prove that the pre-processing solution reducessignificantly PNMT errors and contributes to the improvement of the MT systems’ for Vietnameselanguage.
|
Page generated in 0.0319 seconds