Spelling suggestions: "subject:"ford alignment"" "subject:"word alignment""
1 |
Structured classification for multilingual natural language processingBlunsom, Philip Unknown Date (has links) (PDF)
This thesis investigates the application of structured sequence classification models to multilingual natural language processing (NLP). Many tasks tackled by NLP can be framed as classification, where we seek to assign a label to a particular piece of text, be it a word, sentence or document. Yet often the labels which we’d like to assign exhibit complex internal structure, such as labelling a sentence with its parse tree, and there may be an exponential number of them to choose from. Structured classification seeks to exploit the structure of the labels in order to allow both generalisation across labels which differ by only a small amount, and tractable searches over all possible labels. In this thesis we focus on the application of conditional random field (CRF) models (Lafferty et al., 2001). These models assign an undirected graphical structure to the labels of the classification task and leverage dynamic programming algorithms to efficiently identify the optimal label for a given input. We develop a range of models for two multilingual NLP applications: word-alignment for statistical machine translation (SMT), and multilingual super tagging for highly lexicalised grammars.
|
2 |
Experimentální překladač z češtiny do slovenštiny / Czech-Slovak Machine TranslationZachar, Lukáš January 2012 (has links)
This thesis describes ideas and theories behind machine translatinon, informs the reader about existing machine translation system Moses and by utilizing it proposes system, which is able to learn and later translate from Czech language into Slovak language.
|
3 |
Automatické vytváření slovníků z paralelních korpusů / Automatic dictionary acquisition from parallel corporaPopelka, Jan January 2011 (has links)
In this work, an extensible word-alignment framework is implemented from scratch. It is based on a discriminative method that combines a wide range of lexical association measures and other features and requires a small amount of manually word-aligned data to optimize parameters of the model. The optimal alignment is found as minimum-weight edge cover, selected suboptimal alignments are used to estimate confidence of each alignment link. Feature combination is tuned in the course of many experiments with respect to the results of evaluation. The evaluation results are compared to GIZA++. The best trained model is used to word-align a large Czech-English parallel corpus and from the links of highest confidence a bilingual lexicon is extracted. Single-word translation equivalents are sorted by their significance. Lexicons of different sizes are extracted by taking top N translations. Precision of the lexicons is evaluated automatically and also manually by judging random samples.
|
4 |
Tracing Translation Universals and Translator Development by Word Aligning a Harry Potter CorpusHelgegren, Sofia January 2005 (has links)
<p>For the purpose of this descriptive translation study, a translation corpus was built from roughly the first 20,000 words of each of the first four Harry Potter books by J.K. Rowling, and their respective translations into Swedish. I*Link, a new type of word alignment tool, was used to align the samples on a word level and to investigate and analyse the aligned corpus. The purpose of the study was threefold: to investigate manifestations of translation universals, to search for evidence of translator development and to study the efficiency of different strategies for using the alignment tools.</p><p>The results show that all three translation universals were manifested in the corpus, both on a general pattern level and on a more specific lexical level. Additionally, a clear pattern of translator development was discovered, showing that there are differences between the four different samples. The tendency is that the translations become further removed from the original texts, and this difference occurs homogeneously and sequentially. In the word alignment, four different ways of using the tools were tested, and one strategy was found to be more efficient than the others. This strategy uses dynamic resources from previous alignment sessions as input to I*Trix, an automatic alignment tool, and the output file is manually post-edited in I*Link.</p><p>In conclusion, the study shows how new tools and methods can be used in descriptive translation studies to extract information that is not readily obtainable with traditional tools and methods.</p>
|
5 |
Recycling Translations : Extraction of Lexical Data from Parallel Corpora and their Application in Natural Language ProcessingTiedemann, Jörg January 2003 (has links)
<p>The focus of this thesis is on re-using translations in natural language processing. It involves the collection of documents and their translations in an appropriate format, the automatic extraction of translation data, and the application of the extracted data to different tasks in natural language processing.</p><p>Five parallel corpora containing more than 35 million words in 60 languages have been collected within co-operative projects. All corpora are sentence aligned and parts of them have been analyzed automatically and annotated with linguistic markup.</p><p>Lexical data are extracted from the corpora by means of word alignment. Two automatic word alignment systems have been developed, the Uppsala Word Aligner (UWA) and the Clue Aligner. UWA implements an iterative "knowledge-poor" word alignment approach using association measures and alignment heuristics. The Clue Aligner provides an innovative framework for the combination of statistical and linguistic resources in aligning single words and multi-word units. Both aligners have been applied to several corpora. Detailed evaluations of the alignment results have been carried out for three of them using fine-grained evaluation techniques.</p><p>A corpus processing toolbox, Uplug, has been developed. It includes the implementation of UWA and is freely available for research purposes. A new version, Uplug II, includes the Clue Aligner. It can be used via an experimental web interface (UplugWeb).</p><p>Lexical data extracted by the word aligners have been applied to different tasks in computational lexicography and machine translation. The use of word alignment in monolingual lexicography has been investigated in two studies. In a third study, the feasibility of using the extracted data in interactive machine translation has been demonstrated. Finally, extracted lexical data have been used for enhancing the lexical components of two machine translation systems.</p>
|
6 |
Recycling Translations : Extraction of Lexical Data from Parallel Corpora and their Application in Natural Language ProcessingTiedemann, Jörg January 2003 (has links)
The focus of this thesis is on re-using translations in natural language processing. It involves the collection of documents and their translations in an appropriate format, the automatic extraction of translation data, and the application of the extracted data to different tasks in natural language processing. Five parallel corpora containing more than 35 million words in 60 languages have been collected within co-operative projects. All corpora are sentence aligned and parts of them have been analyzed automatically and annotated with linguistic markup. Lexical data are extracted from the corpora by means of word alignment. Two automatic word alignment systems have been developed, the Uppsala Word Aligner (UWA) and the Clue Aligner. UWA implements an iterative "knowledge-poor" word alignment approach using association measures and alignment heuristics. The Clue Aligner provides an innovative framework for the combination of statistical and linguistic resources in aligning single words and multi-word units. Both aligners have been applied to several corpora. Detailed evaluations of the alignment results have been carried out for three of them using fine-grained evaluation techniques. A corpus processing toolbox, Uplug, has been developed. It includes the implementation of UWA and is freely available for research purposes. A new version, Uplug II, includes the Clue Aligner. It can be used via an experimental web interface (UplugWeb). Lexical data extracted by the word aligners have been applied to different tasks in computational lexicography and machine translation. The use of word alignment in monolingual lexicography has been investigated in two studies. In a third study, the feasibility of using the extracted data in interactive machine translation has been demonstrated. Finally, extracted lexical data have been used for enhancing the lexical components of two machine translation systems.
|
7 |
Bayesian Models for Multilingual Word AlignmentÖstling, Robert January 2015 (has links)
In this thesis I explore Bayesian models for word alignment, how they can be improved through joint annotation transfer, and how they can be extended to parallel texts in more than two languages. In addition to these general methodological developments, I apply the algorithms to problems from sign language research and linguistic typology. In the first part of the thesis, I show how Bayesian alignment models estimated with Gibbs sampling are more accurate than previous methods for a range of different languages, particularly for languages with few digital resources available—which is unfortunately the state of the vast majority of languages today. Furthermore, I explore how different variations to the models and learning algorithms affect alignment accuracy. Then, I show how part-of-speech annotation transfer can be performed jointly with word alignment to improve word alignment accuracy. I apply these models to help annotate the Swedish Sign Language Corpus (SSLC) with part-of-speech tags, and to investigate patterns of polysemy across the languages of the world. Finally, I present a model for multilingual word alignment which learns an intermediate representation of the text. This model is then used with a massively parallel corpus containing translations of the New Testament, to explore word order features in 1001 languages.
|
8 |
Tracing Translation Universals and Translator Development by Word Aligning a Harry Potter CorpusHelgegren, Sofia January 2005 (has links)
For the purpose of this descriptive translation study, a translation corpus was built from roughly the first 20,000 words of each of the first four Harry Potter books by J.K. Rowling, and their respective translations into Swedish. I*Link, a new type of word alignment tool, was used to align the samples on a word level and to investigate and analyse the aligned corpus. The purpose of the study was threefold: to investigate manifestations of translation universals, to search for evidence of translator development and to study the efficiency of different strategies for using the alignment tools. The results show that all three translation universals were manifested in the corpus, both on a general pattern level and on a more specific lexical level. Additionally, a clear pattern of translator development was discovered, showing that there are differences between the four different samples. The tendency is that the translations become further removed from the original texts, and this difference occurs homogeneously and sequentially. In the word alignment, four different ways of using the tools were tested, and one strategy was found to be more efficient than the others. This strategy uses dynamic resources from previous alignment sessions as input to I*Trix, an automatic alignment tool, and the output file is manually post-edited in I*Link. In conclusion, the study shows how new tools and methods can be used in descriptive translation studies to extract information that is not readily obtainable with traditional tools and methods.
|
9 |
Utveckling av ett verktyg för länkning och bedömning av översättningarEriksson, Joel January 2015 (has links)
Idag finns det m˚anga system f¨or att bed¨oma och tolka ¨overs¨attningar av texter. Det finns system som l¨ankar delar av en k¨alltext och en ¨overs¨attning, det finns en ¨aven tekniker f¨or att bed¨oma ¨overs¨attningar f¨or ge ett m˚att p˚a hur bra de ¨ar. Ett exempel p˚a en s˚adan teknik ¨ar Token Equivalence Method(TEM). Det finns dock f˚a program, om n˚agra, som utnyttjar b˚ade l¨ankning och bed¨omning p˚a ett s˚adant s¨att att de skulle kunna vara anv¨andbara vid till exempel spr˚akutbildningar. I detta arbete utvecklas just ett s˚adant program. Programmet som skapats kan segmentera och l¨anka parallella texter mot varandra helt automatiskt via inkopplade system. F¨or att ¨oka anv¨andarv¨anligheten s˚a visualiserar programmet ¨aven l¨ankningen och till˚ater redigering av b˚ade segmentering och l¨ankning. L¨ankningen utnyttjas sedan f¨or att r¨akna ut och visa delar av TEM f¨or att ge ett m˚att p˚a ¨overs¨attningens kvalit´e.
|
10 |
On-demand Development of Statistical Machine Translation Systems / Développement à la demande des systèmes de traduction automatique statistiquesGong, Li 25 November 2014 (has links)
La traduction automatique statistique produit des résultats qui en font un choix privilégié dans la plupart des scénarios de traduction assistée par ordinateur.Cependant, le développement de ces systèmes de haute performance implique des traitements très coûteux sur des données à grande échelle. De nouvelles données sont continuellement disponibles,alors que les systèmes construits de manière standard sont statiques, ce qui rend l'utilisation de nouvelles données couteuse car les systèmes sont typiquement reconstruits en intégralité.En outre, le processus d'adaptation des systèmes de traduction est généralement fondé sur un corpus de développement et est effectué une fois pour toutes. Dans cette thèse, nous proposons un cadre informatique pour répondre à ces trois problèmes conjointement. Ce cadre permet de développer des systèmes de traduction à la demande avec des mises à jour incrémentales et permet d’adapter les systèmes construits à chaque nouveau texte à traduire.La première contribution importante de cette thèse concerne une nouvelle méthode d'alignement sous-phrastique qui peut aligner des paires de phrases en isolation. Cette propriété permet aux systèmes de traduction de calculer des informations à la demande afin d'intégrer de façon transparente de nouvelles données disponibles sans re-entraînement complet des systèmes.La deuxième contribution importante de cette thèse est de proposer l'intégration de stratégies d'échantillonnage contextuel pour sélectionner des exemples de traduction à partir de corpus à grande échelle sur la base de leur similarité avec le texte à traduire afin d obtenir des tables de traduction adaptées / Statistical Machine Translation (SMT) produces results that make it apreferred choice in most machine-assisted translation scenarios.However,the development of such high-performance systems involves thecostly processing of very large-scale data. New data are constantly madeavailable while the constructed SMT systems are usually static, so thatincorporating new data into existing SMT systems imposes systemdevelopers to re-train systems from scratch. In addition, the adaptationprocess of SMT systems is typically based on some available held-outdevelopment set and is performed once and for all.In this thesis, wepropose an on-demand framework that tackles the 3 above problemsjointly, to enable to develop SMT systems on a per-need with incremental updates and to adapt existing systems to each individual input text.The first main contribution of this thesis is devoted to a new on-demandword alignment method that aligns training sentence pairs in isolation.This property allows SMT systems to compute information on a per-needbasis and to seamlessly incorporate new available data into an exiting SMT system without re-training the whole systems. The second maincontribution of this thesis is the integration of contextual sampling strategies to select translation examples from large-scale corpora that are similar to the input text so as to build adapted phrase tables
|
Page generated in 0.0812 seconds