61 |
Compound Processing for Phrase-Based Statistical Machine TranslationStymne, Sara January 2009 (has links)
In this thesis I explore how compound processing can be used to improve phrase-based statistical machine translation (PBSMT) between English and German/Swedish. Both German and Swedish generally use closed compounds, which are written as one word without spaces or other indicators of word boundaries. Compounding is both common and productive, which makes it problematic for PBSMT, mainly due to sparse data problems. The adopted strategy for compound processing is to split compounds into their component parts before training and translation. For translation into Swedish and German the parts are merged after translation. I investigate the effect of different splitting algorithms for translation between English and German, and of different merging algorithms for German. I also apply these methods to a different language pair, English--Swedish. Overall the studies show that compound processing is useful, especially for translation from English into German or Swedish. But there are improvements for translation into English as well, such as a reduction of unknown words. I show that for translation between English and German different splitting algorithms work best for different translation directions. I also design and evaluate a novel merging algorithm based on part-of-speech matching, which outperforms previous methods for compound merging, showing the need for information that is carried through the translation process, rather than only external knowledge sources such as word lists. Most of the methods for compound processing were originally developed for German. I show that these methods can be applied to Swedish as well, with similar results.
|
62 |
Target-Dominant Chinese-English Machine TranslationSu, Dan 23 November 2004 (has links) (PDF)
Information exchange is increasing rapidly with the advent of globalization. As the language spoken by the most people in today's world, Chinese will play an important role in information exchange in the future. Therefore, we need an efficient and practical means to access the increasingly large volume of Chinese data. This thesis describes a target-dominant Chinese-English machine translation system, which can translate a given Chinese news sentence into English. We conjecture that we can improve the state of the art of MT using a TDMT approach. This system has participated in the NIST (National Institute of Standards and Technology) 2004 machine translation competition. Experimental results on Penn Chinese Treebank corpus show that a machine translation system adopting a target-dominant approach is promising.
|
63 |
Utvärdering av domänanpassning i maskinöversättningssystem för användning inom MyScania / Evaluation of domain customization in machine translation systems for use in MyScaniaOlofsson, Martin, Larsson, Jesper January 2022 (has links)
Denna rapport syftar primärt till att undersöka hur väl system för maskinöversättning kan prestera i relation till Scanias kravbild. Undersökningen riktar sig främst till att undersöka systemens förmåga till domänanpassning och vilken effekt det har på dess maskinöversättningar. Utvärdering görs dels med automatiska utvärderingsmetoder som på olika sätt mäter korrelation till existerande textinnehåll från diverse tjänster i samlingsplattformen MyScania, men även manuellt av översättare med erfarenhet inom Scanias språkbruk. Resultatet av denna undersökning visade att domänanpassning med egna data generellt ökar kvaliteten av maskinöversättningar. Det noteras även hur väl maskinöversättningarna presterar varierar mycket på faktorer som exempelvis språk. Google AutoML lyckas däremot prestera bäst i alla de testade språken. Detta visades vid både automatisk utvärdering och manuell utvärdering. Undersökningen visade även svagheter i automatisk utvärderingsmetrik vid fristående användning men samtidigt att det kan bidra med meningsfulla insikter när det kompletteras med mänsklig bedömning. Undersökningen bekräftar att mänsklig bedömning alltid bör användas om det är möjligt. / This report’s primary purpose is to examine how well systems for machine translation can perform in relation to what is sought after by Scania. This examination is primarily aimed at investigating the systems capability for domain customization and what effects these have on the results of machine translations. Evaluation is done partly using multiple automatic metrics that in different ways measure correlation to existing translations within MyScania, combined with manual evaluation done by translators experienced with Scania’s language usage. The results of this examination showed that domain customization using own data generally increases the quality of machine translations. It is noted that how the machine translations perform is affected by many factors such as languages, Google AutoML however succeeds to perform the best in all the tested languages. This is proven both in evaluation using automatic metrics and manual evaluation. This investigation also showed weaknesses in automatic metrics in stand-alone use but that they can contribute with meaningful knowledge when complemented by manual evaluation. This investigation confirms that manual evaluation should always be used when possible.
|
64 |
Neural maskinöversättning av gawarbati / Neural machine translation for GawarbatiGillholm, Katarina January 2023 (has links)
Nya neurala modeller har lett till stora framsteg inom maskinöversättning, men fungerar fortfarande sämre på språk som saknar stora mängder parallella data, så kallade lågresursspråk. Gawarbati är ett litet, hotat lågresursspråk där endast 5000 parallella meningar finns tillgängligt. Denna uppsats använder överföringsinlärning och hyperparametrar optimerade för små datamängder för att undersöka möjligheter och begränsningar för neural maskinöversättning från gawarbati till engelska. Genom att använda överföringsinlärning där en föräldramodell först tränades på hindi-engelska förbättrades översättningar med 1.8 BLEU och 1.3 chrF. Hyperparametrar optimerade för små datamängder ökade BLEU med 0.6 men minskade chrF med 1. Att kombinera överföringsinlärning och hyperparametrar optimerade för små datamängder försämrade resultatet med 0.5 BLEU och 2.2 chrF. De neurala modellerna jämförs med och presterar bättre än ordbaserad statistisk maskinöversättning och GPT-3. Den bäst presterande modellen uppnådde endast 2.8 BLEU och 19 chrF, vilket belyser begränsningarna av maskinöversättning på lågresursspråk samt det kritiska behovet av mer data. / Recent neural models have led to huge improvements in machine translation, but performance is still suboptimal for languages without large parallel datasets, so called low resource languages. Gawarbati is a small, threatened low resource language with only 5000 parallel sentences. This thesis uses transfer learning and hyperparameters optimized for small datasets to explore possibilities and limitations for neural machine translation from Gawarbati to English. Transfer learning, where the parent model was trained on parallel data between Hindi and English, improved results by 1.8 BLEU and 1.3 chrF. Hyperparameters optimized for small datasets increased BLEU by 0.6 but decreased chrF by 1. Combining transfer learning and hyperparameters optimized for small datasets led to a decrease in performance by 0.5 BLEU and 2.2 chrF. The neural models outperform a word based statistical machine translation and GPT-3. The highest performing model only achieved 2.8 BLEU and 19 chrF, which illustrates the limitations of machine translation for low resource languages and the critical need for more data. / VR 2020-01500
|
65 |
Exact sampling and optimisation in statistical machine translationAziz, Wilker Ferreira January 2014 (has links)
In Statistical Machine Translation (SMT), inference needs to be performed over a high-complexity discrete distribution de ned by the intersection between a translation hypergraph and a target language model. This distribution is too complex to be represented exactly and one typically resorts to approximation techniques either to perform optimisation { the task of searching for the optimum translation { or sampling { the task of nding a subset of translations that is statistically representative of the goal distribution. Beam-search is an example of an approximate optimisation technique, where maximisation is performed over a heuristically pruned representation of the goal distribution. For inference tasks other than optimisation, rather than nding a single optimum, one is really interested in obtaining a set of probabilistic samples from the distribution. This is the case in training where one wishes to obtain unbiased estimates of expectations in order to t the parameters of a model. Samples are also necessary in consensus decoding where one chooses from a sample of likely translations the one that minimises a loss function. Due to the additional computational challenges posed by sampling, n-best lists, a by-product of optimisation, are typically used as a biased approximation to true probabilistic samples. A more direct procedure is to attempt to directly draw samples from the underlying distribution rather than rely on n-best list approximations. Markov Chain Monte Carlo (MCMC) methods, such as Gibbs sampling, o er a way to overcome the tractability issues in sampling, however their convergence properties are hard to assess. That is, it is di cult to know when, if ever, an MCMC sampler is producing samples that are compatible iii with the goal distribution. Rejection sampling, a Monte Carlo (MC) method, is more fundamental and natural, it o ers strong guarantees, such as unbiased samples, but is typically hard to design for distributions of the kind addressed in SMT, rendering an intractable method. A recent technique that stresses a uni ed view between the two types of inference tasks discussed here | optimisation and sampling | is the OS approach. OS can be seen as a cross between Adaptive Rejection Sampling (an MC method) and A optimisation. In this view the intractable goal distribution is upperbounded by a simpler (thus tractable) proxy distribution, which is then incrementally re ned to be closer to the goal until the maximum is found, or until the sampling performance exceeds a certain level. This thesis introduces an approach to exact optimisation and exact sampling in SMT by addressing the tractability issues associated with the intersection between the translation hypergraph and the language model. The two forms of inference are handled in a uni ed framework based on the OS approach. In short, an intractable goal distribution, over which one wishes to perform inference, is upperbounded by tractable proposal distributions. A proposal represents a relaxed version of the complete space of weighted translation derivations, where relaxation happens with respect to the incorporation of the language model. These proposals give an optimistic view on the true model and allow for easier and faster search using standard dynamic programming techniques. In the OS approach, such proposals are used to perform a form of adaptive rejection sampling. In rejection sampling, samples are drawn from a proposal distribution and accepted or rejected as a function of the mismatch between the proposal and the goal. The technique is adaptive in that rejected samples are used to motivate a re nement of the upperbound proposal that brings it closer to the goal, improving the rate of acceptance. Optimisation can be connected to an extreme form of sampling, thus the framework introduced here suits both exact optimisation and exact iv sampling. Exact optimisation means that the global maximum is found with a certi cate of optimality. Exact sampling means that unbiased samples are independently drawn from the goal distribution. We show that by using this approach exact inference is feasible using only a fraction of the time and space that would be required by a full intersection, without recourse to pruning techniques that only provide approximate solutions. We also show that the vast majority of the entries (n-grams) in a language model can be summarised by shorter and optimistic entries. This means that the computational complexity of our approach is less sensitive to the order of the language model distribution than a full intersection would be. Particularly in the case of sampling, we show that it is possible to draw exact samples compatible with distributions which incorporate a high-order language model component from proxy distributions that are much simpler. In this thesis, exact inference is performed in the context of both hierarchical and phrase-based models of translation, the latter characterising a problem that is NP-complete in nature.
|
66 |
New data-driven approaches to text simplificationŠtajner, Sanja January 2016 (has links)
No description available.
|
67 |
Model adaptation techniques in machine translationShah, Kashif 29 June 2012 (has links) (PDF)
Nowadays several indicators suggest that the statistical approach to machinetranslation is the most promising. It allows fast development of systems for anylanguage pair provided that sufficient training data is available.Statistical Machine Translation (SMT) systems use parallel texts ‐ also called bitexts ‐ astraining material for creation of the translation model and monolingual corpora fortarget language modeling.The performance of an SMT system heavily depends upon the quality and quantity ofavailable data. In order to train the translation model, the parallel texts is collected fromvarious sources and domains. These corpora are usually concatenated, word alignmentsare calculated and phrases are extracted.However, parallel data is quite inhomogeneous in many practical applications withrespect to several factors like data source, alignment quality, appropriateness to thetask, etc. This means that the corpora are not weighted according to their importance tothe domain of the translation task. Therefore, it is the domain of the training resourcesthat influences the translations that are selected among several choices. This is incontrast to the training of the language model for which well‐known techniques areused to weight the various sources of texts.We have proposed novel methods to automatically weight the heterogeneous data toadapt the translation model.In a first approach, this is achieved with a resampling technique. A weight to eachbitexts is assigned to select the proportion of data from that corpus. The alignmentscoming from each bitexts are resampled based on these weights. The weights of thecorpora are directly optimized on the development data using a numerical method.Moreover, an alignment score of each aligned sentence pair is used as confidencemeasurement.In an extended work, we obtain such a weighting by resampling alignments usingweights that decrease with the temporal distance of bitexts to the test set. By thesemeans, we can use all the available bitexts and still put an emphasis on the most recentone. The main idea of our approach is to use a parametric form or meta‐weights for theweighting of the different parts of the bitexts. This ensures that our approach has onlyfew parameters to optimize.In another work, we have proposed a generic framework which takes into account thecorpus and sentence level "goodness scores" during the calculation of the phrase‐tablewhich results into better distribution of probability mass of the individual phrase pairs.
|
68 |
Automatic post-editing of phrase-based machine translation outputs / Automatic post-editing of phrase-based machine translation outputsRosa, Rudolf January 2013 (has links)
We present Depfix, a system for automatic post-editing of phrase-based English-to-Czech machine trans- lation outputs, based on linguistic knowledge. First, we analyzed the types of errors that a typical machine translation system makes. Then, we created a set of rules and a statistical component that correct errors that are common or serious and can have a potential to be corrected by our approach. We use a range of natural language processing tools to provide us with analyses of the input sentences. Moreover, we reimple- mented the dependency parser and adapted it in several ways to parsing of statistical machine translation outputs. We performed both automatic and manual evaluations which confirmed that our system improves the quality of the translations.
|
69 |
Měření kvality strojového překladu / Measures of Machine Translation QualityMacháček, Matouš January 2014 (has links)
Title: Measures of Machine Translation Quality Author: Matouš Macháček Department: Institute of Formal and Applied Linguistics Supervisor: RNDr. Ondřej Bojar, Ph.D. Abstract: We explore both manual and automatic methods of machine trans- lation evaluation. We propose a manual evaluation method in which anno- tators rank only translations of short segments instead of whole sentences. This results in easier and more efficient annotation. We have conducted an annotation experiment and evaluated a set of MT systems using this method. The obtained results are very close to the official WMT14 evaluation results. We also use the collected database of annotations to automatically evalu- ate new, unseen systems and to tune parameters of a statistical machine translation system. The evaluation of unseen systems, however, does not work and we analyze the reasons. To explore the automatic methods, we organized Metrics Shared Task held during the Workshop of Statistical Ma- chine Translation in years 2013 and 2014. We report the results of the last shared task, discuss various metaevaluation methods and analyze some of the participating metrics. Keywords: machine translation, evaluation, automatic metrics, annotation
|
70 |
Indução de léxicos bilíngües e regras para a tradução automática / Induction of translation lexicons and transfer rules for machine translationCaseli, Helena de Medeiros 21 May 2007 (has links)
A Tradução Automática (TA) -- tradução de uma língua natural (fonte) para outra (alvo) por meio de programas de computador -- é uma tarefa árdua devido, principalmente, à necessidade de um conhecimento lingüístico aprofundado das duas (ou mais) línguas envolvidas para a construção de recursos, como gramáticas de tradução, dicionários bilíngües etc. A escassez de recursos lingüísticos, e mesmo a dificuldade em produzi-los, geralmente são fatores limitantes na atuação dos sistemas de TA, restringindo-os, por exemplo, quanto ao domínio de aplicação. Neste contexto, diversos métodos vêm sendo propostos com o intuito de gerar, automaticamente, conhecimento lingüístico a partir dos recursos multilíngües e, assim, tornar a construção de tradutores automáticos menos trabalhosa. O projeto ReTraTos, apresentado neste documento, é uma dessas propostas e visa à indução automática de léxicos bilíngües e de regras de tradução a partir de corpora paralelos etiquetados morfossintaticamente e alinhados lexicalmente para os pares de idiomas português--espanhol e português--inglês. O sistema proposto para a indução de regras de tradução apresenta uma abordagem inovadora na qual os exemplos de tradução são divididos em blocos de alinhamento e a indução é realizada para cada bloco, separadamente. Outro fator inovador do sistema de indução é uma filtragem mais elaborada das regras induzidas. Além dos sistemas de indução de léxicos bilíngües e de regras de tradução, implementou-se também um módulo de tradução automática para permitir a validação dos recursos induzidos. Os léxicos bilíngües foram avaliados intrinsecamente e os resultados obtidos estão de acordo com os relatados na literatura para essa área. As regras de tradução foram avaliadas direta e indiretamente por meio do módulo de TA e sua utilização trouxe um ganho na tradução palavra-a-palavra em todos os sentidos (fonte--alvo e alvo--fonte) para a tradução dos idiomas em estudo. As traduções geradas com os recursos induzidos no ReTraTos também foram comparadas às geradas por sistemas comerciais, apresentando melhores resultados para o par de línguas português--espanhol do que para o par português--inglês. / Machine Translation (MT) -- the translation of a natural (source) language into another (target) by means of computer programs -- is a hard task, mainly due to the need of deep linguistic knowledge about the two (or more) languages required to build resources such as translation grammars, bilingual dictionaries, etc. The scarcity of linguistic resources or even the difficulty to build them often limits the use of MT systems, for example, to certain application domains. In this context, several methods have been proposed aiming at generating linguistic knowledge automatically from multilingual resources, so that building translation tools becomes less hard. The ReTraTos project presented in this document is one of these proposals and aims at inducing translation lexicons and transfer rules automatically from PoS-tagged and lexically aligned translation examples for Portuguese--Spanish and Portuguese--English language pairs. The rule induction system brings forth a new approach, in which translation examples are split into alignment blocks and induction is performed for each type of block separately. Another new feature of this system is a more elaborate strategy for filtering the induced rules. Besides the translation lexicon and the transfer rule induction systems, we also implemented a MT module for validating the induced resources. The induced translation lexicons were evaluated intrinsically and the results obtained agree with those reported on the literature. The induced translation rules were evaluated directly and indirectly by the MT module, and improved the word-by-word translation in both directions (source--target and target--source) for the languages under study. The target sentences obtained by the induced resources were also compared to those generated by commercial systems, showing better results for Portuguese--Spanish than for Portuguese--English.
|
Page generated in 0.1161 seconds