171 |
A generic and open framework for multiword expressions treatment : from acquisition to applicationsRamisch, Carlos Eduardo January 2012 (has links)
The treatment of multiword expressions (MWEs), like take off, bus stop and big deal, is a challenge for NLP applications. This kind of linguistic construction is not only arbitrary but also much more frequent than one would initially guess. This thesis investigates the behaviour of MWEs across different languages, domains and construction types, proposing and evaluating an integrated methodological framework for their acquisition. There have been many theoretical proposals to define, characterise and classify MWEs. We adopt generic definition stating that MWEs are word combinations which must be treated as a unit at some level of linguistic processing. They present a variable degree of institutionalisation, arbitrariness, heterogeneity and limited syntactic and semantic variability. There has been much research on automatic MWE acquisition in the recent decades, and the state of the art covers a large number of techniques and languages. Other tasks involving MWEs, namely disambiguation, interpretation, representation and applications, have received less emphasis in the field. The first main contribution of this thesis is the proposal of an original methodological framework for automatic MWE acquisition from monolingual corpora. This framework is generic, language independent, integrated and contains a freely available implementation, the mwetoolkit. It is composed of independent modules which may themselves use multiple techniques to solve a specific sub-task in MWE acquisition. The evaluation of MWE acquisition is modelled using four independent axes. We underline that the evaluation results depend on parameters of the acquisition context, e.g., nature and size of corpora, language and type of MWE, analysis depth, and existing resources. The second main contribution of this thesis is the application-oriented evaluation of our methodology proposal in two applications: computer-assisted lexicography and statistical machine translation. For the former, we evaluate the usefulness of automatic MWE acquisition with the mwetoolkit for creating three lexicons: Greek nominal expressions, Portuguese complex predicates and Portuguese sentiment expressions. For the latter, we test several integration strategies in order to improve the treatment given to English phrasal verbs when translated by a standard statistical MT system into Portuguese. Both applications can benefit from automatic MWE acquisition, as the expressions acquired automatically from corpora can both speed up and improve the quality of the results. The promising results of previous and ongoing experiments encourage further investigation about the optimal way to integrate MWE treatment into other applications. Thus, we conclude the thesis with an overview of the past, ongoing and future work.
|
172 |
Aperfeiçoamento de um tradutor automático Português-Inglês: tempos verbais / Development of a Portuguese-to-English machine translation system: tensesLucia Helena Rozario da Silva 03 August 2010 (has links)
Esta dissertação apresenta o aperfeiçoamento de um sistema de tradução automática português-inglês. Nosso objetivo principal é criar regras de transferência estrutural entre o par de línguas português e inglês e avaliar, através do uso da métrica de avaliação METEOR, o desempenho do sistema. Para isto, utilizamos um corpus teste criado especialmente para esta pesquisa. Tendo como ponto de partida a relevância de uma correta tradução para os tempos verbais de uma sentença, este trabalho priorizou a criação de regras que tratassem a transferência entre os tempos verbais do português brasileiro para o inglês americano. Devido ao fato de os verbos em português estarem distribuídos por três conjugações, criamos um corpus para cada uma dessas conjugações. O objetivo da criação desses corpora é verificar a aplicação das regras de transferência estrutural entre os tempos verbais em todas as três classes de conjugação. Após a criação dos corpora, mapeamos os tempos verbais em português no modo indicativo, subjuntivo e imperativo para os tempos verbais do inglês. Em seguida, iniciamos a construção das regras de transferência estrutural entre os tempos verbais mapeados. Ao final da construção das regras, submetemos os corpora obedecendo as três classes de conjugação à métrica de avaliação automática METEOR. Os resultados da avaliação do sistema após a inserção das regras apresentaram uma regressão quando comparado a avaliação do sistema no estágio inicial da pesquisa. Detectamos, através de análises dos resultados, que a métrica de avaliação automática METEOR não foi sensível às modificações feitas no sistema, embora as regras criadas sigam a gramática tradicional da língua portuguesa e estejam sendo aplicadas a todas as três classes de conjugação. Apresentamos em detalhes o conjunto de regras sintáticas e os corpora utilizados neste estudo, e que acreditamos serem de utilidade geral para quaisquer sistemas de tradução automática entre o português brasileiro e o inglês americano. Outra contribuição deste trabalho está em discutir os valores apresentados pela métrica METEOR e sugerir que novos ajustes sejam feitos a esses parâmetros utilizados pela métrica. / This dissertation presents the development of a Portuguese-to-English Machine Translation system. Our main objective is creating structural transfer rules between this pair of languages, and evaluate the performance of the system using the METEOR evaluation metric. Therefore, we developed a corpus to enable this study. Taking translation relevance as a starting point, we focused on verbal tenses and developed rules that dealt with transfer between verbal tenses from the Brazilian Portuguese to US English. Due to the fact that verbs in Portuguese are distributed in three conjugations, we created one corpus for each of these conjugations. The objective was to verify the application of structural transfer rules between verbal tenses in each conjugation class in isolation. After creating these corpora, we mapped the Portuguese verbal tenses in the indicative, subjunctive and imperative modes to English. Next, we constructed structural transfer rules to these mapped verbal tenses. After constructing these rules, we evaluated our corpora using the METEOR evaluation metric. The results of this evaluation showed lack of improvement after the insertion of these transfer rules, when compared to the initial stage of the system. We detected that the METEOR evaluation metric was not sensible to these modi_cations made to the system, even though they were linguistically sound and were being applied correctly to the sentences. We introduce in details the set of transfer rules and corpora used in this study, and we believe they are general enough to be useful in any rule-based Portuguese-to-English Machine Translation system. Another contribution of this work lies in the discussion of the results presented by the METEOR metric. We suggest adjustments to be made to its parameters, in order to make it more sensible to sentences variation such as those introduced by our rules.
|
173 |
Requisitos para a modelagem de padrões de cunhagem e construções semi-produtivas no constructicon da FrameNet Brasil com foco no fomento ao desenvolvimento de tradutores automáticosTavares, Tatiane Silva 28 September 2018 (has links)
Submitted by Geandra Rodrigues (geandrar@gmail.com) on 2018-11-01T18:58:55Z
No. of bitstreams: 1
tatianesilvatavares.pdf: 3652351 bytes, checksum: ab319a3daf577b4727647589658a6657 (MD5) / Approved for entry into archive by Adriana Oliveira (adriana.oliveira@ufjf.edu.br) on 2018-11-23T11:24:07Z (GMT) No. of bitstreams: 1
tatianesilvatavares.pdf: 3652351 bytes, checksum: ab319a3daf577b4727647589658a6657 (MD5) / Made available in DSpace on 2018-11-23T11:24:07Z (GMT). No. of bitstreams: 1
tatianesilvatavares.pdf: 3652351 bytes, checksum: ab319a3daf577b4727647589658a6657 (MD5)
Previous issue date: 2018-09-28 / Esta pesquisa tem por objetivo investigar as contribuições que a FrameNet e o Constructicon podem oferecer aos sistemas de Tradução por Máquina (TM) ao revisitar a literatura sobre Gramática das Construções e Padrões de Cunhagem (KAY, 2013). A hipótese é a de que utilização da base de dados de uma FrameNet, que oferece representações computacionais das estruturas cognitivas essenciais na construção do sentido (os frames), e de um Constructicon, o qual integra a informação sobre a gramática de uma língua, podem auxiliar o processamento de línguas naturais pelo computador e, consequentemente, auxiliar os sistemas de Tradução por
Máquina. O segundo recorte deste trabalho refere-se à revisão da Gramática das Construções, especialmente em relação ao tratamento de padrões de cunhagem. Segundo a abordagem de Kay (2005, 2013), deve-se considerar como construção apenas a quantidade mínima de informação que o falante precisa ter para que seja capaz de entender e produzir sentenças da língua. Nesta perspectiva, assume-se que as construções de uma língua sejam estes padrões mais gerais e produtivos, os quais licenciam as mais diversas sentenças compreensíveis pelo falante. Por isso buscamos reanalisar a estrutura de quantificação indefinida: i) mar de gente, ii) oceano de
calúnias, iii) enxurrada de notícias, já investigada em dissertação de mestrado (TAVARES, 2014), a fim de que se possa discutir a validade da abordagem de Kay para a modelagem computacional do padrão de quantificação. A análise dos dados é guiada pelos pressupostos da Semântica de Frames (FILLLMORE, 1982,1985; PETRUCK, 1996) e pela metodologia da FrameNet, a qual emprega a análise semântica e sintática dos objetos de investigação, pois, para as tarefas de Processamento de Línguas Naturais, é necessário que se leve em conta as
regularidades da língua para que esta seja processada computacionalmente. A conclusão da análise aponta para o fato de que a modelagem da estrutura de quantificação indefinida como uma rede de padrões de cunhagem no Constructicon da FrameNet Brasil, definidos a partir de restrições soft, pode trazer um ganho de generalidade na análise e a possibilidade de que estruturas inovadoras, cunhadas por analogia, sejam igualmente reconhecíveis pelo modelo computacional resultante, o que representa um avanço em relação aos processos tradicionalmente empregados na hibridização de sistemas de tradução por máquina. / This work aims to investigate the contributions that the FrameNet and the Constructicon could offer to Machine Translation (MT) systems by revisiting the literature about Construction Grammar and Coinage Patterns (KAY, 2013). The hypothesis is that by using the FrameNet database, which offers computational representations of essential cognitive structures formeaning construction (frames), and a Constructicon database, which integrates information about the grammar of a language, it is possible to support the natural language processing by the computer and, consequently, to assist Machine Translation systems. The second point of this
work refers to the review of Construction Grammar, especially the treatment of coinage patterns. According to Kay’s approach (2005, 2013), it must be considered a construction only the minimal quantity of information that a speaker needs to be able to understand and to produce sentences of the language. From this perspective, it is assumed that constructions of a language are these more general and productive patterns, which license many comprehensible sentences by the speaker. For this reason we seek to reanalyze the indefinite quantity structure: i) mar de gente, ii) oceano de calúnias, iii) enxurrada de notícias, already investigated in a dissertation
(TAVARES, 2014), in order to discuss the validity of Kay's approach to the computational modeling of the quantification pattern. The analyses is guided by Frame Semantics principles (FILLLMORE, 1982,1985; PETRUCK, 1996) and by FrameNet methodology, which applies a semantic and syntactic analyses to the objects of this research, because dealing with Natural Language Processing demands an account for the language regularities so it can be computationally processed. The conclusion of the analysis points to the fact that the modeling of the indefinite quantification structure as a network of coinage patterns in the FrameNet Brasil Constructicon, defined from soft constraints, can bring about a gain of generality in the analysis and the possibility that innovative structures, coined by analogy, are equally recognizable by the resulting computational model, which represents an advance over the processes traditionally
employed in the hybridization of machine translation systems.
|
174 |
'Consider' and its Swedish equivalents in relation to machine translationAndersson, Karin January 2007 (has links)
This study describes the English verb ’consider’ and the characteristics of some of its senses. An investigation of this kind may be useful, since a machine translation program, SYSTRAN, has invariably translated ’consider’ with the Swedish verbs ’betrakta’ (Eng: ’view’, regard’) and ’anse’ (Eng: ’regard’). This handling of ’consider’ is not satisfactory in all contexts. Since ’consider’ is a cogitative verb, it is fascinating to observe that both the theory of semantic primes and universals and conceptual semantics are concerned with cogitation in various ways. Anna Wierzbicka, who is one of the advocates of semantic primes and universals, argues that THINK should be considered as a semantic prime. Moreover, one of the prime issues of conceptual semantics is to describe how thoughts are constructed by virtue of e.g. linguistic components, perception and experience. In order to define and clarify the distinctions between the different senses, we have taken advantage of the theory of mental spaces. This thesis has been structured in accordance with the meanings that have been indicated in WordNet as to ’consider’. As a consequence, the senses that ’consider’ represents have been organized to form the subsequent groups: ’Observation’, ’Opinion’ together with its sub-group ’Likelihood’ and ’Cogitation’ followed by its sub-group ’Attention/Consideration’. A concordance tool, http://www.nla.se/culler, provided us with 90 literary quotations that were collected in a corpus. Afterwards, these citations were distributed between the groups mentioned above and translated into Swedish by SYSTRAN. Furthermore, the meanings as to ’consider’ have also been related to the senses, recorded by the FrameNet scholars. Here, ’consider’ is regarded as a verb of ’Cogitation’ and ’Categorization’. When this study was accomplished, it could be inferred that certain senses are connected to specific syntactic constructions. In other cases, however, the distinctions between various meanings can only be explained by virtue of semantics. To conclude, it appears to be likely that an implementation is facilitated if a specific syntactic construction can be tied to a particular sense. This may be the case concerning some meanings of ’consider’. Machine translation is presumably a much more laborious task, if one is solely governed by semantic conditions.
|
175 |
Data-driven natural language generation using statistical machine translation and discriminative learning / L'approche discriminante à la génération de la paroleManishina, Elena 05 February 2016 (has links)
L'humanité a longtemps été passionnée par la création de machines intellectuelles qui peuvent librement intéragir avec nous dans notre langue. Tous les systèmes modernes qui communiquent directement avec l'utilisateur partagent une caractéristique commune: ils ont un système de dialogue à la base. Aujourd'hui pratiquement tous les composants d'un système de dialogue ont adopté des méthodes statistiques et les utilisent largement comme leurs modèles de base. Jusqu'à récemment la génération de langage naturel (GLN) utilisait pour la plupart des patrons/modèles codés manuellement, qui représentaient des phrases types mappées à des réalisations sémantiques particulières. C'était le cas jusqu'à ce que les approches statistiques aient envahi la communauté de recherche en systèmes de dialogue. Dans cette thèse, nous suivons cette ligne de recherche et présentons une nouvelle approche à la génération de la langue naturelle. Au cours de notre travail, nous nous concentrons sur deux aspects importants du développement des systèmes de génération: construire un générateur performant et diversifier sa production. Deux idées principales que nous défendons ici sont les suivantes: d'abord, la tâche de GLN peut être vue comme la traduction entre une langue naturelle et une représentation formelle de sens, et en second lieu, l'extension du corpus qui impliquait traditionnellement des paraphrases définies manuellement et des règles spécialisées peut être effectuée automatiquement en utilisant des méthodes automatiques d'extraction des synonymes et des paraphrases bien connues et largement utilisées. En ce qui concerne notre première idée, nous étudions la possibilité d'utiliser le cadre de la traduction automatique basé sur des modèles ngrams; nous explorons également le potentiel de l'apprentissage discriminant (notamment les champs aléatoires markoviens) appliqué à la GLN; nous construisons un système de génération qui permet l'inclusion et la combinaison des différents modèles et qui utilise un cadre de décodage efficace (automate à état fini). En ce qui concerne le second objectif, qui est l'extension du corpus, nous proposons d'élargir la taille du vocabulaire et le nombre de l'ensemble des structures syntaxiques disponibles via l'intégration des synonymes et des paraphrases. À notre connaissance, il n'y a pas eu de tentatives d'augmenter la taille du vocabulaire d'un système de GLN en incorporant les synonymes. À ce jour, la plupart d'études sur l'extension du corpus visent les paraphrases et recourent au crowdsourcing pour les obtenir, ce qui nécessite une validation supplémentaire effectuée par les développeurs du système. Nous montrons que l'extension du corpus au moyen d'extraction automatique de paraphrases et la validation automatique sont tout aussi efficaces, étant en même temps moins coûteux en termes de temps de développement et de ressources. Au cours d'expériences intermédiaires nos modèles ont montré une meilleure performance que celle obtenue par le modèle de référence basé sur les syntagmes et se sont révélés d'être plus robustes, pour le traitement des combinaisons inconnues de concepts, que le générateur à base des règles. L'évaluation humaine finale a prouvé que les modèles représent une alternative solide au générateur à base des règles / The humanity has long been passionate about creating intellectual machines that can freely communicate with us in our language. Most modern systems communicating directly with the user share one common feature: they have a dialog system (DS) at their base. As of today almost all DS components embraced statistical methods and widely use them as their core models. Until recently Natural Language Generation (NLG) component of a dialog system used primarily hand-coded generation templates, which represented model phrases in a natural language mapped to a particular semantic content. Today data-driven models are making their way into the NLG domain. In this thesis, we follow along this new line of research and present several novel data-driven approaches to natural language generation. In our work we focus on two important aspects of NLG systems development: building an efficient generator and diversifying its output. Two key ideas that we defend here are the following: first, the task of NLG can be regarded as the translation between a natural language and a formal meaning representation, and therefore, can be performed using statistical machine translation techniques, and second, corpus extension and diversification which traditionally involved manual paraphrasing and rule crafting can be performed automatically using well-known and widely used synonym and paraphrase extraction methods. Concerning our first idea, we investigate the possibility of using NGRAM translation framework and explore the potential of discriminative learning, notably Conditional Random Fields (CRF) models, as applied to NLG; we build a generation pipeline which allows for inclusion and combination of different generation models (NGRAM and CRF) and which uses an efficient decoding framework (finite-state transducers' best path search). Regarding the second objective, namely corpus extension, we propose to enlarge the system's vocabulary and the set of available syntactic structures via integrating automatically obtained synonyms and paraphrases into the training corpus. To our knowledge, there have been no attempts to increase the size of the system vocabulary by incorporating synonyms. To date most studies on corpus extension focused on paraphrasing and resorted to crowd-sourcing in order to obtain paraphrases, which then required additional manual validation often performed by system developers. We prove that automatic corpus extension by means of paraphrase extraction and validation is just as effective as crowd-sourcing, being at the same time less costly in terms of development time and resources. During intermediate experiments our generation models showed a significantly better performance than the phrase-based baseline model and appeared to be more robust in handling unknown combinations of concepts than the current in-house rule-based generator. The final human evaluation confirmed that our data-driven NLG models is a viable alternative to rule-based generators.
|
176 |
Model adaptation techniques in machine translation / Techniques d'adaptation en traduction automatiqueShah, Kashif 29 June 2012 (has links)
L’approche statistique pour la traduction automatique semble être aujourd’hui l’approche la plusprometteuse. Cette approche permet de développer rapidement un système de traduction pour unenouvelle paire de langue lorsque les données d'apprentissage disponibles sont suffisammentconséquentes.Les systèmes de traduction automatique statistique (Statistical Machine Translation (SMT)) utilisentdes textes parallèles, aussi appelés les bitextes, comme support d'apprentissage pour créer lesmodèles de traduction. Ils utilisent également des corpus monolingues afin de modéliser la langueciblée.Les performances d'un système de traduction automatique statistique dépendent essentiellement dela qualité et de la quantité des données disponibles. Pour l'apprentissage d'un modèle de traduction,les textes parallèles sont collectés depuis différentes sources, dans différents domaines. Ces corpussont habituellement concaténés et les phrases sont extraites suite à un processus d'alignement desmots.Néanmoins, les données parallèles sont assez hétérogènes et les performances des systèmes detraduction automatique dépendent généralement du contexte applicatif. Les performances varient laplupart du temps en fonction de la source de données d’apprentissage, de la qualité de l'alignementet de la cohérence des données avec la tâche. Les traductions, sélectionnées parmi différenteshypothèses, sont directement influencées par le domaine duquel sont récupérées les donnéesd'apprentissage. C'est en contradiction avec l'apprentissage des modèles de langage pour lesquelsdes techniques bien connues sont utilisées pour pondérer les différentes sources de données. Ilapparaît donc essentiel de pondérer les corpus d’apprentissage en fonction de leur importance dansle domaine de la tâche de traduction.Nous avons proposé de nouvelles méthodes permettant de pondérer automatiquement les donnéeshétérogènes afin d'adapter le modèle de traduction.Dans une première approche, cette pondération automatique est réalisée à l'aide d'une technique deré-échantillonnage. Un poids est assigné à chaque bitextes en fonction de la proportion de donnéesdu corpus. Les alignements de chaque bitextes sont par la suite ré-échantillonnés en fonction de cespoids. Le poids attribué aux corpus est optimisé sur les données de développement en utilisant uneméthode numérique. De plus, un score d'alignement relatif à chaque paire de phrase alignée estutilisé comme mesure de confiance.Dans un travail approfondi, nous pondérons en ré-échantillonnant des alignements, en utilisant despoids qui diminuent en fonction de la distance temporelle entre les bitextes et les données de test.Nous pouvons, de cette manière, utiliser tous les bitextes disponibles tout en mettant l'accent sur leplus récent.L'idée principale de notre approche est d'utiliser une forme paramétrique, ou des méta-poids, pourpondérer les différentes parties des bitextes. De cette manière, seuls quelques paramètres doiventêtre optimisés.Nous avons également proposé un cadre de travail générique qui, lors du calcul de la table detraduction, ne prend en compte que les corpus et les phrases réalisant les meilleurs scores. Cetteapproche permet une meilleure distribution des masses de probabilités sur les paires de phrasesindividuelles.Nous avons présenté les résultats de nos expériences dans différentes campagnes d'évaluationinternationales, telles que IWSLT, NIST, OpenMT et WMT, sur les paires de langues Anglais/Arabeet Fançais/Arabe. Nous avons ainsi montré une amélioration significative de la qualité destraductions proposées. / Nowadays several indicators suggest that the statistical approach to machinetranslation is the most promising. It allows fast development of systems for anylanguage pair provided that sufficient training data is available.Statistical Machine Translation (SMT) systems use parallel texts ‐ also called bitexts ‐ astraining material for creation of the translation model and monolingual corpora fortarget language modeling.The performance of an SMT system heavily depends upon the quality and quantity ofavailable data. In order to train the translation model, the parallel texts is collected fromvarious sources and domains. These corpora are usually concatenated, word alignmentsare calculated and phrases are extracted.However, parallel data is quite inhomogeneous in many practical applications withrespect to several factors like data source, alignment quality, appropriateness to thetask, etc. This means that the corpora are not weighted according to their importance tothe domain of the translation task. Therefore, it is the domain of the training resourcesthat influences the translations that are selected among several choices. This is incontrast to the training of the language model for which well‐known techniques areused to weight the various sources of texts.We have proposed novel methods to automatically weight the heterogeneous data toadapt the translation model.In a first approach, this is achieved with a resampling technique. A weight to eachbitexts is assigned to select the proportion of data from that corpus. The alignmentscoming from each bitexts are resampled based on these weights. The weights of thecorpora are directly optimized on the development data using a numerical method.Moreover, an alignment score of each aligned sentence pair is used as confidencemeasurement.In an extended work, we obtain such a weighting by resampling alignments usingweights that decrease with the temporal distance of bitexts to the test set. By thesemeans, we can use all the available bitexts and still put an emphasis on the most recentone. The main idea of our approach is to use a parametric form or meta‐weights for theweighting of the different parts of the bitexts. This ensures that our approach has onlyfew parameters to optimize.In another work, we have proposed a generic framework which takes into account thecorpus and sentence level "goodness scores" during the calculation of the phrase‐tablewhich results into better distribution of probability mass of the individual phrase pairs.
|
177 |
Sélection de corpus en traduction automatique statistique / Efficient corpus selection for statistical machine translationAbdul Rauf, Sadaf 17 January 2012 (has links)
Dans notre monde de communications au niveau international, la traduction automatique est devenue une technologie clef incontournable. Plusieurs approches existent, mais depuis quelques années la dite traduction automatique statistique est considérée comme la plus prometteuse. Dans cette approche, toutes les connaissances sont extraites automatiquement à partir d'exemples de traductions, appelés textes parallèles, et des données monolingues en langue cible. La traduction automatique statistique est un processus guidé par les données. Ceci est communément avancé comme un grand avantage des approches statistiques puisque l'intervention d'être humains bilingues n'est pas nécessaire, mais peut se retourner en un problème lorsque ces données nécessaires au développement du système ne sont pas disponibles, de taille insuffisante ou dont le genre ne convient pas. Les recherches présentées dans cette thèse sont une tentative pour surmonter un des obstacles au déploiement massif de systèmes de traduction automatique statistique : le manque de corpus parallèles. Un corpus parallèle est une collection de phrases en langues source et cible qui sont alignées au niveau de la phrase. La plupart des corpus parallèles existants ont été produits par des traducteurs professionnels. Ceci est une tâche coûteuse, en termes d'argent, de ressources humaines et de temps. Dans la première partie de cette thèse, nous avons travaillé sur l'utilisation de corpus comparables pour améliorer les systèmes de traduction statistique. Un corpus comparable est une collection de données en plusieurs langues, collectées indépendamment, mais qui contiennent souvent des parties qui sont des traductions mutuelles. La taille et la qualité des contenus parallèles peuvent variées considérablement d'un corpus comparable à un autre, en fonction de divers facteurs, notamment la méthode de construction du corpus. Dans tous les cas, il n'est pas aisé d'identifier automatiquement des parties parallèles. Dans le cadre de cette thèse, nous avons développé une telle approche qui est entièrement basée sur des outils librement disponibles. L'idée principale de notre approche est l'utilisation d'un système de traduction automatique statistique pour traduire toutes les phrases en langue source du corpus comparable. Chacune de ces traductions est ensuite utilisée en tant que requête afin de trouver des phrases potentiellement parallèles. Cette recherche est effectuée à l'aide d'un outil de recherche d'information. En deuxième étape, les phrases obtenues sont comparées aux traductions automatiques afin de déterminer si elles sont effectivement parallèles à la phrase correspondante en langue source. Plusieurs critères ont été évalués tels que le taux d'erreur de mots ou le «translation edit rate (TER)». Nous avons effectué une analyse expérimentale très détaillée afin de démontrer l'intérêt de notre approche. Les corpus comparables utilisés se situent dans le domaine des actualités, plus précisément, des dépêches d'actualités des agences de presse telles que «Agence France Press (AFP)», «Associate press» ou «Xinua News». Ces agences publient quotidiennement des actualités en plusieurs langues. Nous avons pu extraire des textes parallèles à partir de grandes collections de plus de trois cent millions de mots pour les paires de langues français/anglais et arabe/anglais. Ces textes parallèles ont permis d'améliorer significativement nos systèmes de traduction statistique. Nous présentons également une comparaison théorique du modèle développé dans cette thèse avec une autre approche présentée dans la littérature. Diverses extensions sont également étudiées : l'extraction automatique de mots inconnus et la création d'un dictionnaire, la détection et suppression 1 d'informations supplémentaires, etc. Dans la deuxième partie de cette thèse, nous avons examiné la possibilité d'utiliser des données monolingues afin d'améliorer le modèle de traduction d'un système statistique... / In our world of international communications, machine translation has become a key technology essential. Several pproaches exist, but in recent years the so-called Statistical Machine Translation (SMT) is considered the most promising. In this approach, knowledge is automatically extracted from examples of translations, called parallel texts, and monolingual data in the target language. Statistical machine translation is a data driven process. This is commonly put forward as a great advantage of statistical approaches since no human intervention is required, but this can also turn into a problem when the necessary development data are not available, are too small or the domain is not appropriate. The research presented in this thesis is an attempt to overcome barriers to massive deployment of statistical machine translation systems: the lack of parallel corpora. A parallel corpus is a collection of sentences in source and target languages that are aligned at the sentence level. Most existing parallel corpora were produced by professional translators. This is an expensive task in terms of money, human resources and time. This thesis provides methods to overcome this need by exploiting the easily available huge comparable and monolingual data collections. We present two effective architectures to achieve this.In the first part of this thesis, we worked on the use of comparable corpora to improve statistical machine translation systems. A comparable corpus is a collection of texts in multiple languages, collected independently, but often containing parts that are mutual translations. The size and quality of parallel contents may vary considerably from one comparable corpus to another, depending on various factors, including the method of construction of the corpus. In any case, itis not easy to automatically identify the parallel parts. As part of this thesis, we developed an approach which is entirely based on freely available tools. The main idea of our approach is the use of a statistical machine translation system to translate all sentences in the source language comparable corpus to the target language. Each of these translations is then used as query to identify potentially parallel sentences from the target language comparable corpus. This research is carried out using an information retrieval toolkit. In the second step, the retrieved sentences are compared to the automatic translation to determine whether they are parallel to the corresponding sentence in source language. Several criteria wereevaluated such as word error rate or the translation edit rate (TER) and TERp. We conducted a very detailed experimental analysis to demonstrate the interest of our approach. We worked on comparable corpora from the news domain, more specifically, multilingual news agencies such as, "Agence France Press (AFP)", "Associate Press" or "Xinua News." These agencies publish daily news in several languages. We were able to extract parallel texts from large collections of over three hundred million words for French-English and Arabic-English language pairs. These parallel texts have significantly improved our statistical translation systems. We also present a theoretical comparison of the model developed in this thesis with another approach presented in the literature. Various extensions are also discussed: automatic extraction of unknown words and the creation of a dictionary, detection and suppression of extra information, etc.. In the second part of this thesis, we examined the possibility of using monolingual data to improve the translation model of a statistical system. The idea here is to replace parallel data by monolingual source or target language data. This research is thus placed in the context of unsupervised learning, since missing translations are produced by an automatic translation system, and after various filtering, reinjected into the system...
|
178 |
Graph Models For Query Focused Text Summarization And Assessment Of Machine Translation Using StopwordsRama, B 06 1900 (has links) (PDF)
Text summarization is the task of generating a shortened version of the original text where core ideas of the original text are retained. In this work, we focus on query focused summarization. The task is to generate the summary from a set of documents which answers the query. Query focused summarization is a hard task because it expects the summary to be biased towards the query and at the same time important concepts in the original documents must be preserved with high degree of novelty.
Graph based ranking algorithms which use biased random surfer model like Topic-sensitive LexRank have been applied to query focused summarization. In our work, we propose look-ahead version of Topic-sensitive LexRank. We incorporate the option of look-ahead in the random walk model and we show that it helps in generating better quality summaries.
Next, we consider assessment of machine translation. Assessment of a machine translation output is important for establishing benchmarks for translation quality. An obvious way to assess the quality of machine translation is through the perception of human subjects. Though highly reliable, this approach is not scalable and is time consuming. Hence mechanisms have been devised to automate the assessment process. All such assessment methods are essentially a study of correlations between human translation and the machine translation.
In this work, we present a scalable approach to assess the quality of machine translation that borrows features from the study of writing styles, popularly known as Stylometry. Towards this, we quantify the characteristic styles of individual machine translators and compare them with that of human generated text. The translator whose style is closest to human style is deemed to generate a higher quality translation. We show that our approach is scalable and does not require actual source text translations for evaluation.
|
179 |
Enseigner la traduction humaine en s'inspirant de la traduction automatique. / Teaching human translation taking inspiration from machine translation / Insegnare la traduzione umana ispirandosi alla traduzione automaticaCennamo, Ilaria 15 May 2015 (has links)
Notre projet de recherche concerne l’étude de l’interaction homme-machine (H-M) en situation d’enseignement/apprentissage de la traduction de l’italien au français.Notre thèse est centrée notamment sur l’analyse de l’utilité pédagogique issue de l’intégration d’un traducteur automatique basé sur des règles dans un contexte d’apprentissage de la traduction de niveau Master, auprès de l’université de Gênes.Existerait-il une possibilité d’interaction entre la pensée traductionnelle humaine et la pensée traductionnelle machine qui puisse s’avérer efficace dans un contexte de pédagogie de la traduction ?Notre projet de recherche vise à répondre à cette question à travers la mise en place d’une expérimentation pédagogique qui s’appuie sur l’interaction entre l’apprenti traducteur humain et le système Apertium.L’hypothèse émise est qu’une telle interaction entre l’apprenti humain et notre prototype de traducteur automatique puisse favoriser la réflexion méta-traductionnelle chez l’apprenti humain, en encourageant sa prise de conscience des nombreux facteurs impliqués dans l’activité traduisante, et en contribuant à son apprentissage de la traduction au niveau de la systématisation de ses connaissances traductionnelles. / Our project aims at studying human-machine (H-M) interaction in the context of Italian to French translation teaching and learning, at a master degree level in translation and interpretation. More precisely, our focus is on the pedagogical usefulness of such a H-M interaction having been put in place thanks to the integration of a rule-based machine translator, namely the system Apertium , in a prototypical version.Can this interaction between machine translation and human translation strategies represent a useful pedagogical tool for translation training? Our hypothesis is that H-M interaction taking place between human translation learners and our machine translation prototype can encourage learners’ meta-translational reflection. This process would help them in becoming aware of all the factors implied in translating, and would allow the systematisation of their translation knowledge.
|
180 |
Système de traduction automatique français-chinois dans le domaine de la sécurité globale / French-Chinese machine translation system for global securityJin, Gan 19 February 2015 (has links)
Dans ce mémoire, nous présentons outre les résultats de recherche en vue d’un système de traduction automatique français–chinois, les apports théoriques à partir de la théorie SyGULAC et de la théorie micro-systémique avec ses calculs ainsi que les méthodologies élaborées tendant à une application sure et fiable dans le cadre de la traduction automatique. L’application porte sur des domaines de sécurité critique tels que l’aéronautique, la médecine, la sécurité civile. Tout d’abord un état de l’art du domaine de la traduction automatique, en Chine et en France, est utile pour commencer la lecture. Les faiblesses des systèmes actuels à travers des tests que nous réalisons prouvent l’intérêt de cette recherche. Nous donnons les raisons pour lesquelles nous avons choisi la théorie micro-systémique et la théorie SyGULAC. Nous expliquons ensuite les problématiques rencontrées au cours de notre recherche. L’ambigüité, obstacle majeur pour la compréhensibilité et la traductibilité d’un texte, se situe à tous les niveaux de la langue : syntaxique, morphologique, lexical, nominal ou encore verbal. L’identification des unités d’une phrase est aussi une étape préalable à la compréhension globale, que cela soit pour un être humain ou un système de traduction. Nous dressons un état des lieux de la divergence entre la langue française et la langue chinoise en vue de réaliser un système de traduction automatique. Nous essayons d’observer la structure aux niveaux verbal, nominal et lexical, de comprendre leurs liens et leurs interactions. Egalement nous définissons les obstacles sources d’entrave à la réalisation de cette recherche, avec un point de vue théorique mais aussi en étudiant notre corpus concret. Le formalisme pour lequel nous avons opté part d’une étude approfondie de la langue utilisée dans les protocoles de sécurité. Une langue ne se prête au traitement automatique que si elle est formalisée. De ce fait, nous avons procédé à l’analyse de plusieurs corpus bilingues français/chinois mais aussi monolingues émanant d’organismes de sécurité civile. Le but est de dégager les particularités linguistiques (lexicales, syntaxiques, …) qui caractérisent la langue de la sécurité en général et de recenser toutes les structures syntaxiques qu’utilise cette langue. Après avoir présenté la formalisation de notre système, nous montrons les processus de reconnaissance, de transfert et de génération. / In this paper, in addition to our research results for a French-Chinese machine translation system, we present the theoretical contributions from the SyGULAC theory and from the micro-systemic theory with its calculations as well as the methodologies developed aimed at a secure and reliable application in the context of machine translation. The application covers critical safety areas such as aerospace, medicine and civil security.After presenting the state of the art in the field of machine translation in China and France, the reasons of the choice of the micro-systemic theory and SyGULAC theory are explained. Then, we explain the problems encountered during our research. The ambiguity, which is the major obstacle to the understandability and to the translatability of a text, is present at all language levels: syntactic, morphological, lexical, nominal and verbal. The identification of the units of a sentence is also a preliminary step for global understanding, whether for human beings or for a translation system. We present an inventory of the divergences between the french and the chinese language in order to achieve an machine translation system. We try to observe the verbal, nominal and vocabulary structure levels, in order to understand their interconnections and their interactions. We also define the obstacles to this research, with a theoretical point of view but also by studying our corpus.The chosen formalism starts from a thorough study of the language used in security protocols. A language is suitable for automatic processing only if this language is formalized. Therefore, An analysis of several French/Chinese bilingual corpora, but also monolingual, from civil security agencies, was conducted. The goal is to find out and present the linguistic characteristics (lexical, syntactic ...) which characterize the language of security in general, and to identify all the syntactic structures used by this language. After presenting the formalization of our system, we show the recognition, transfer and generation processes.
|
Page generated in 0.0403 seconds