• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 929
  • 156
  • 74
  • 55
  • 27
  • 23
  • 18
  • 13
  • 10
  • 9
  • 8
  • 7
  • 5
  • 5
  • 4
  • Tagged with
  • 1601
  • 1601
  • 1601
  • 622
  • 565
  • 464
  • 383
  • 376
  • 266
  • 256
  • 245
  • 228
  • 221
  • 208
  • 204
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
801

Applying Coreference Resolution for Usage in Dialog Systems

Rolih, Gabi January 2018 (has links)
Using references in language is a major part of communication, and understanding them is not a challenge for humans. Recent years have seen increased usage of dialog systems that interact with humans in natural language to assist them in various tasks, but even the most sophisticated systems still struggle with understanding references. In this thesis, we adapt a coreference resolution system for usage in dialog systems and try to understand what is needed for an efficient understanding of references in dialog systems. We annotate a portion of logs from a customer service system and perform an analysis of the most common coreferring expressions appearing in this type of data. This analysis shows that most coreferring expressions are nominal and pronominal, and they usually appear within two sentences of each other. We implement Stanford's Multi-Pass Sieve with some adaptations and dialog-specific changes and integrate it into a dialog system framework. The preprocessing pipeline makes use of already existing NLP-tools, while some new ones are added, such as a chunker, a head-finding algorithm and a NER-like system. To analyze both user input and output of the system, we deploy two separate coreference resolution systems that interact with each other. An evaluation is performed on the system and its separate parts in five most common evaluation metrics. The system does not achieve state-of-the art numbers, but because of its domain-specific nature that is expected. Some parts of the system do not have any effect on the performance, while the dialog-specific changes contribute to it greatly. An error analysis is concluded and reveals some problems with the implementation, but more importantly, it shows how the system could be further improved by using other types of knowledge and dialog-specific features.
802

Complex Word Identification for Swedish

Smolenska, Greta January 2018 (has links)
Complex Word Identification (CWI) is a task of identifying complex words in text data and it is often viewed as a subtask of Automatic Text Simplification (ATS) where the main task is making a complex text simpler. The ways in which a text should be simplified depend on the target readers such as second language learners or people with reading disabilities. In this thesis, we focus on Complex Word Identification for Swedish. First, in addition to exploring existing resources, we collect a new dataset for Swedish CWI. We continue by building several classifiers of Swedish simple and complex words. We then use the findings to analyze the characteristics of lexical complexity in Swedish and English. Our method for collecting training data based on second language learning material has shown positive evaluation scores and resulted in a new dataset for Swedish CWI. Additionally, the built complex word classifiers have an accuracy at least as good as similar systems for English. Finally, the analysis of the selected features confirms the findings of previous studies and reveals some interesting characteristics of lexical complexity.
803

Exploiting resources from closely-related languages for automatic speech recognition in low-resource languages from Malaysia / Utilisation de ressources dans une langue proche pour la reconnaissance automatique de la parole pour les langues peu dotées de Malaisie

Samson Juan, Sarah Flora 09 July 2015 (has links)
Les langues en Malaisie meurent à un rythme alarmant. A l'heure actuelle, 15 langues sont en danger alors que deux langues se sont éteintes récemment. Une des méthodes pour sauvegarder les langues est de les documenter, mais c'est une tâche fastidieuse lorsque celle-ci est effectuée manuellement.Un système de reconnaissance automatique de la parole (RAP) serait utile pour accélérer le processus de documentation de ressources orales. Cependant, la construction des systèmes de RAP pour une langue cible nécessite une grande quantité de données d'apprentissage comme le suggèrent les techniques actuelles de l'état de l'art, fondées sur des approches empiriques. Par conséquent, il existe de nombreux défis à relever pour construire des systèmes de transcription pour les langues qui possèdent des quantités de données limitées.L'objectif principal de cette thèse est d'étudier les effets de l'utilisation de données de langues étroitement liées, pour construire un système de RAP pour les langues à faibles ressources en Malaisie. Des études antérieures ont montré que les méthodes inter-lingues et multilingues pourraient améliorer les performances des systèmes de RAP à faibles ressources. Dans cette thèse, nous essayons de répondre à plusieurs questions concernant ces approches: comment savons-nous si une langue est utile ou non dans un processus d'apprentissage trans-lingue ? Comment la relation entre la langue source et la langue cible influence les performances de la reconnaissance de la parole ? La simple mise en commun (pooling) des données d'une langue est-elle une approche optimale ?Notre cas d'étude est l'iban, une langue peu dotée de l'île de Bornéo. Nous étudions les effets de l'utilisation des données du malais, une langue locale dominante qui est proche de l'iban, pour développer un système de RAP pour l'iban, sous différentes contraintes de ressources. Nous proposons plusieurs approches pour adapter les données du malais afin obtenir des modèles de prononciation et des modèles acoustiques pour l'iban.Comme la contruction d'un dictionnaire de prononciation à partir de zéro nécessite des ressources humaines importantes, nous avons développé une approche semi-supervisée pour construire rapidement un dictionnaire de prononciation pour l'iban. Celui-ci est fondé sur des techniques d'amorçage, pour améliorer la correspondance entre les données du malais et de l'iban.Pour augmenter la performance des modèles acoustiques à faibles ressources, nous avons exploré deux techniques de modélisation : les modèles de mélanges gaussiens à sous-espaces (SGMM) et les réseaux de neurones profonds (DNN). Nous avons proposé, dans ce cadre, des méthodes de transfert translingue pour la modélisation acoustique permettant de tirer profit d'une grande quantité de langues “proches” de la langue cible d'intérêt. Les résultats montrent que l'utilisation de données du malais est bénéfique pour augmenter les performances des systèmes de RAP de l'iban. Par ailleurs, nous avons également adapté les modèles SGMM et DNN au cas spécifique de la transcription automatique de la parole non native (très présente en Malaisie). Nous avons proposé une approche fine de fusion pour obtenir un SGMM multi-accent optimal. En outre, nous avons développé un modèle DNN spécifique pour la parole accentuée. Les deux approches permettent des améliorations significatives de la précision du système de RAP. De notre étude, nous observons que les modèles SGMM et, de façon plus surprenante, les modèles DNN sont très performants sur des jeux de données d'apprentissage en quantité limités. / Languages in Malaysia are dying in an alarming rate. As of today, 15 languages are in danger while two languages are extinct. One of the methods to save languages is by documenting languages, but it is a tedious task when performed manually.Automatic Speech Recognition (ASR) system could be a tool to help speed up the process of documenting speeches from the native speakers. However, building ASR systems for a target language requires a large amount of training data as current state-of-the-art techniques are based on empirical approach. Hence, there are many challenges in building ASR for languages that have limited data available.The main aim of this thesis is to investigate the effects of using data from closely-related languages to build ASR for low-resource languages in Malaysia. Past studies have shown that cross-lingual and multilingual methods could improve performance of low-resource ASR. In this thesis, we try to answer several questions concerning these approaches: How do we know which language is beneficial for our low-resource language? How does the relationship between source and target languages influence speech recognition performance? Is pooling language data an optimal approach for multilingual strategy?Our case study is Iban, an under-resourced language spoken in Borneo island. We study the effects of using data from Malay, a local dominant language which is close to Iban, for developing Iban ASR under different resource constraints. We have proposed several approaches to adapt Malay data to obtain pronunciation and acoustic models for Iban speech.Building a pronunciation dictionary from scratch is time consuming, as one needs to properly define the sound units of each word in a vocabulary. We developed a semi-supervised approach to quickly build a pronunciation dictionary for Iban. It was based on bootstrapping techniques for improving Malay data to match Iban pronunciations.To increase the performance of low-resource acoustic models we explored two acoustic modelling techniques, the Subspace Gaussian Mixture Models (SGMM) and Deep Neural Networks (DNN). We performed cross-lingual strategies using both frameworks for adapting out-of-language data to Iban speech. Results show that using Malay data is beneficial for increasing the performance of Iban ASR. We also tested SGMM and DNN to improve low-resource non-native ASR. We proposed a fine merging strategy for obtaining an optimal multi-accent SGMM. In addition, we developed an accent-specific DNN using native speech data. After applying both methods, we obtained significant improvements in ASR accuracy. From our study, we observe that using SGMM and DNN for cross-lingual strategy is effective when training data is very limited.
804

Le traitement automatique de l’arabe dialectalisé : aspects méthodologiques et algorithmiques / Automatic processing of dialectal Arabic : methodological and algorithmic aspects

Saadane, Houda 14 December 2015 (has links)
L'auteur n'a pas fourni de résumé français. / L'auteur n'a pas fourni de résumé anglais.
805

Deep generative models for natural language processing

Miao, Yishu January 2017 (has links)
Deep generative models are essential to Natural Language Processing (NLP) due to their outstanding ability to use unlabelled data, to incorporate abundant linguistic features, and to learn interpretable dependencies among data. As the structure becomes deeper and more complex, having an effective and efficient inference method becomes increasingly important. In this thesis, neural variational inference is applied to carry out inference for deep generative models. While traditional variational methods derive an analytic approximation for the intractable distributions over latent variables, here we construct an inference network conditioned on the discrete text input to provide the variational distribution. The powerful neural networks are able to approximate complicated non-linear distributions and grant the possibilities for more interesting and complicated generative models. Therefore, we develop the potential of neural variational inference and apply it to a variety of models for NLP with continuous or discrete latent variables. This thesis is divided into three parts. Part I introduces a <b>generic variational inference framework</b> for generative and conditional models of text. For continuous or discrete latent variables, we apply a continuous reparameterisation trick or the REINFORCE algorithm to build low-variance gradient estimators. To further explore Bayesian non-parametrics in deep neural networks, we propose a family of neural networks that parameterise categorical distributions with continuous latent variables. Using the stick-breaking construction, an unbounded categorical distribution is incorporated into our deep generative models which can be optimised by stochastic gradient back-propagation with a continuous reparameterisation. Part II explores <b>continuous latent variable models for NLP</b>. Chapter 3 discusses the Neural Variational Document Model (NVDM): an unsupervised generative model of text which aims to extract a continuous semantic latent variable for each document. In Chapter 4, the neural topic models modify the neural document models by parameterising categorical distributions with continuous latent variables, where the topics are explicitly modelled by discrete latent variables. The models are further extended to neural unbounded topic models with the help of stick-breaking construction, and a truncation-free variational inference method is proposed based on a Recurrent Stick-breaking construction (RSB). Chapter 5 describes the Neural Answer Selection Model (NASM) for learning a latent stochastic attention mechanism to model the semantics of question-answer pairs and predict their relatedness. Part III discusses <b>discrete latent variable models</b>. Chapter 6 introduces latent sentence compression models. The Auto-encoding Sentence Compression Model (ASC), as a discrete variational auto-encoder, generates a sentence by a sequence of discrete latent variables representing explicit words. The Forced Attention Sentence Compression Model (FSC) incorporates a combined pointer network biased towards the usage of words from source sentence, which significantly improves the performance when jointly trained with the ASC model in a semi-supervised learning fashion. Chapter 7 describes the Latent Intention Dialogue Models (LIDM) that employ a discrete latent variable to learn underlying dialogue intentions. Additionally, the latent intentions can be interpreted as actions guiding the generation of machine responses, which could be further refined autonomously by reinforcement learning. Finally, Chapter 8 summarizes our findings and directions for future work.
806

Automatic movie analysis and summarisation

Gorinski, Philip John January 2018 (has links)
Automatic movie analysis is the task of employing Machine Learning methods to the field of screenplays, movie scripts, and motion pictures to facilitate or enable various tasks throughout the entirety of a movie’s life-cycle. From helping with making informed decisions about a new movie script with respect to aspects such as its originality, similarity to other movies, or even commercial viability, all the way to offering consumers new and interesting ways of viewing the final movie, many stages in the life-cycle of a movie stand to benefit from Machine Learning techniques that promise to reduce human effort, time, or both. Within this field of automatic movie analysis, this thesis addresses the task of summarising the content of screenplays, enabling users at any stage to gain a broad understanding of a movie from greatly reduced data. The contributions of this thesis are four-fold: (i)We introduce ScriptBase, a new large-scale data set of original movie scripts, annotated with additional meta-information such as genre and plot tags, cast information, and log- and tag-lines. To our knowledge, Script- Base is the largest data set of its kind, containing scripts and information for almost 1,000 Hollywood movies. (ii) We present a dynamic summarisation model for the screenplay domain, which allows for extraction of highly informative and important scenes from movie scripts. The extracted summaries allow for the content of the original script to stay largely intact and provide the user with its important parts, while greatly reducing the script-reading time. (iii) We extend our summarisation model to capture additional modalities beyond the screenplay text. The model is rendered multi-modal by introducing visual information obtained from the actual movie and by extracting scenes from the movie, allowing users to generate visual summaries of motion pictures. (iv) We devise a novel end-to-end neural network model for generating natural language screenplay overviews. This model enables the user to generate short descriptive and informative texts that capture certain aspects of a movie script, such as its genres, approximate content, or style, allowing them to gain a fast, high-level understanding of the screenplay. Multiple automatic and human evaluations were carried out to assess the performance of our models, demonstrating that they are well-suited for the tasks set out in this thesis, outperforming strong baselines. Furthermore, the ScriptBase data set has started to gain traction, and is currently used by a number of other researchers in the field to tackle various tasks relating to screenplays and their analysis.
807

L’Informatique au service des sciences du langage : la conception d’un programme étudiant le parler arabe libanais blanc / Computer science at the service of language sciences : the design of a program studying Arabic Lebanese white speech

El Hage, Antoine 25 January 2017 (has links)
A une époque où l’informatique a envahi tous les aspects de notre vie quotidienne, il est tout à fait normal de voir le domaine informatique participer aux travaux en sciences humaines et sociales, et notamment en linguistique où le besoin de développer des logiciels informatiques se fait de plus en plus pressant avec le volume grandissant des corpus traités. D’où notre travail de thèse qui consiste en l’élaboration d’un programme EPL qui étudie le parler arabe libanais blanc. En partant d’un corpus élaboré à partir de deux émissions télévisées enregistrées puis transcrites en lettres arabes, ce programme, élaboré avec le logiciel Access, nous a permis d’extraire les mots et les collocations et de procéder à une analyse linguistique aux niveaux lexical, phonétique, syntaxique et collocationnel. Le fonctionnement de l’EPL ainsi que le code de son développement sont décrits en détails dans une partie informatique à part. Des annexes de taille closent la thèse et rassemblent le produit des travaux de toute une équipe de chercheures venant de maintes spécialités. / At a time when computer science has invaded all aspects of our daily life, it is natural to see the computer field participating in human and social sciences work, and more particularly in linguistics where the need to develop computer software is becoming more and more pressing with the growing volume of analyzed corpora. Hence our thesis which consists in elaborating a program EPL that studies the white Lebanese Arabic speech. Starting from a corpus elaborated from two TV programs recorded then transcribed in Arabic letters, the program EPL, developed with Access software, allowed us to extract words and collocations, and to carry out a linguistic analysis on the lexical, phonetic, syntactic and collocational levels. The EPL’s functioning as well as its development code are described in the computer part. Important annexes conclude the thesis and gather the result of the work of a team of researchers coming from different specialties.
808

Traitements formels et sémantiques des échanges et des documents textuels liés à des activités collaboratives / Formal and semantic processing of textual exchanges and documents related to collaborative activities

Kalitvianski, Ruslan 20 March 2018 (has links)
Cette thèse s’inscrit dans la problématique de l’extraction de sens à partir de textes et flux textuels, produits dans notre cas lors de processus collaboratifs. Plus précisément, nous nous intéressons aux courriels de travail et aux documents textuels objets de collaboration, avec une première application aux documents éducatifs. La motivation de cet intérêt est d’aider les utilisateurs à accéder plus rapidement aux informations utiles ; nous cherchons donc à les repérer dans les textes. Ainsi, nous nous intéressons aux tâches dans les courriels, et aux fragments de documents éducatifs qui concernent les thèmes de leurs intérêts. Deux corpus, un de courriels et un de documents éducatifs, principalement en français, ont été constitués. Cela était indispensable, car il n’y a pratiquement pas de travaux antérieurs sur ce type de données en français.Notre première contribution théorique est une modélisation générique de la structure de ces données. Nous l’utilisons pour spécifier le traitement formel des documents, prérequis au traitement sémantique. Nous démontrons la difficulté du problème de segmentation, normalisation et structuration de documents en différents formats source, et présentons l’outil SEGNORM, première contribution logicielle de cette thèse. SEGNORM segmente et normalise les documents (en texte brut ou balisé), récursivement et en unités de taille paramétrable. Dans le cas des courriels, il segmente les messages contenant des messages cités en messages individuels, en conservant l’information du chaînage entre les fragments entremêlés. Il analyse également les métadonnées des messages pour reconstruire les fils de discussions, et retrouve dans les citations les messages dont on ne possède pas le fichier source.Nous abordons ensuite le traitement sémantique de ces documents. Nous proposons une modélisation (ontologique) de la notion de tâche, puis décrivons l’annotation d’un corpus de plusieurs centaines de messages issus du contexte professionnel de VISEO et du GETALP. Nous présentons alors la deuxième contribution logicielle de cette thèse, un outil de repérage de tâches et d’extraction de leurs attributs (contraintes temporelles, assignataires, etc.). Cet outil, basé sur une combinaison d’une approche experte et d’apprentissage automatique, est évalué selon des critères classiques de précision, rappel et F-mesure, ainsi que selon la qualité d’usage.Enfin, nous présentons nos travaux sur la plate-forme MACAU-CHAMILO, troisième contribution logicielle, qui aide à l’apprentissage par (1) structuration de documents pédagogiques selon deux ontologies (forme et contenu), (2) accès multilingue à du contenu initialement monolingue. Il s’agit donc de nouveau de structuration selon les deux axes, forme et sens.(1) L’ontologie des formes permet d’annoter les fragments des documents par des concepts comme théorème, preuve, exemple, par des niveaux de difficulté et d’abstraction, et par des relations comme élaboration_de, illustration_de. L’ontologie de domaine modélise les objets formels de l’informatique, et plus précisément les notions de complexité calculatoire. Cela permet de suggérer aux utilisateurs des fragments utiles pour la compréhension de notions d’informatique perçues comme abstraites ou difficiles.(2) L’aspect relatif à l’accès multilingue a été motivé par le constat que nos universités accueillent un grand nombre d’étudiants étrangers, qui ont souvent du mal à comprendre nos cours à cause de la barrière linguistique. Nous avons proposé une approche pour multilingualiser du contenu pédagogique avec l’aide d’étudiants étrangers, par post-édition en ligne de pré-traductions automatiques, puis, si besoin, amélioration incrémentale de ces post-éditions. (Nos expériences ont montré que des versions multilingues de documents peuvent être produites rapidement et sans coût.) Ce travail a abouti à un corpus de plus de 500 pages standard (250 mots/page) de contenu pédagogique post-édité vers le chinois. / This thesis is part of the problematics of the extraction of meaning from texts and textual flows, produced in our case during collaborative processes. More specifically, we are interested in work-related emails and collaborative textual documents, with a first application to educational documents. The motivation for this interest is to help users gain access to useful information more quickly; we hence seek to locate them in the texts. Thus, we are interested in the tasks referred to in the emails, and to the fragments of educational documents which concern the themes of their interests. Two corpora, one of e-mails and one of educational documents, mainly in French, have been created. This was essential because there is virtually no previous work on this type of data in French.Our first theoretical contribution is a generic modeling of the structure of these data. We use it to specify the formal processing of documents, a prerequisite for semantic processing. We demonstrate the difficulty of the problem of segmentation, standardization and structuring of documents in different source formats, and present the SEGNORM tool, the first software contribution of this thesis. SEGNORM segments and normalizes documents (in plain or tagged text), recursively and in units of configurable size. In the case of emails, it segments the messages containing quotations of messages into individual messages, thereby keeping the information about the chaining between the intertwined fragments. It also analyzes the metadata of the messages to reconstruct the threads of discussions, and retrieves in the quotations the messages of which one does not have the source file.We then discuss the semantic processing of these documents. We propose an (ontological) modeling of the notion of task, then describe the annotation of a corpus of several hundred messages originating from the professional context of VISEO and GETALP. We then present the second software contribution of this thesis: the tool for locating tasks and extracting their attributes (temporal constraints, assignees, etc.). This tool, based on a combination of an expert approach and machine learning, is evaluated according to classic criteria of accuracy, recall and F-measure, as well as according to the quality of use.Finally, we present our work on the MACAU-CHAMILO platform, third software contribution, which helps learning by (1) structuring of educational documents according to two ontologies (form and content), (2) multilingual access to content initially monolingual. This is therefore again about structuring along the two axes, form and meaning.(1) The ontology of forms makes it possible to annotate the fragments of documents by concepts such as theorem, proof, example, by levels of difficulty and abstraction, and by relations such as elaboration_of, illustration_of… The domain ontology models the formal objects of informatics, and more precisely the notions of computational complexity. This makes it possible to suggest to the users fragments useful for understanding notions of informatics perceived as abstract or difficult.(2) The aspect related to multilingual access has been motivated by the observation that our universities welcome a large number of foreign students, who often have difficulty understanding our courses because of the language barrier. We proposed an approach to multilingualize educational content with the help of foreign students, by online post-editing of automatic pre-translations, and, if necessary, incremental improvement of these post-editions. (Our experiments have shown that multilingual versions of documents can be produced quickly and without cost.) This work resulted in a corpus of more than 500 standard pages (250 words/page) of post-edited educational content into Chinese.
809

Alinhamento léxico utilizando técnicas híbridas discriminativas e de pós-processamento / Text alignment

Schreiner, Paulo January 2010 (has links)
O alinhamento léxico automático é uma tarefa essencial para as técnicas de tradução de máquina empíricas modernas. A abordagem gerativa não-supervisionado têm sido substituída recentemente por uma abordagem discriminativa supervisionada que facilite inclusão de conhecimento linguístico de uma diversidade de fontes. Dentro deste contexto, este trabalho descreve uma série alinhadores léxicos discriminativos que incorporam heurísticas de pós-processamento com o objetivo de melhorar o desempenho dos mesmos para expressões multi-palavra, que constituem um dos desafios da área de processamento de linguagens naturais atualmente. A avaliação é realizada utilizando um gold-standard obtido a partir da anotação de um corpus paralelo de legendas de filmes. Os alinhadores propostos apresentam um desempenho superior tanto ao obtido por uma baseline quanto ao obtido por um alinhador gerativo do estado-da-arte (Giza++), tanto no caso geral quanto para as expressões foco do trabalho. / Lexical alignment is an essential task for modern empirical machine translation techniques. The unsupervised generative approach is being replaced by a supervised, discriminative one that considerably facilitates the inclusion of linguistic knowledge from several sources. Given this context, the present work describes a series of discriminative lexical aligners that incorporate post-processing heuristics with the goal of improving the quality of the alignments of multiword expressions, which is one of the major challanges in natural language processing today. The evaluation is conducted using a gold-standard obtained from a movie subtitle parallel corpus. The aligners proposed show an alignment quality that is superior both to our baseline and to a state-of-the-art generative aligner (Giza++), for the general case as well as for the expressions that are the focus of this work.
810

Leitura, tradução e medidas de complexidade textual em contos da literatura para leitores com letramento básico

Pasqualini, Bianca Franco January 2012 (has links)
Este trabalho trata dos temas da complexidade textual e de padrões de legibilidade a partir de um enfoque computacional, situando o tema em meio à descrição de textos originais e traduzidos, aproveitando postulados teóricos da Tradutologia, da Linguística de Corpus e do Processamento de Línguas Naturais. Investigou-se a suposição de que há traduções de literatura em língua inglesa produzidas no Brasil que tendem a gerar textos mais complexos do que seus originais, tendo como parâmetro o leitor brasileiro médio, cuja proficiência de leitura situa-se em nível básico. Para testar essa hipótese, processamos, usando as ferramentas Coh-Metrix e Coh-Metrix-Port, um conjunto de contos literários de vários autores em língua inglesa e suas traduções para o português brasileiro, e, como contraste, um conjunto de contos de autores brasileiros publicados na mesma época e suas traduções para o inglês. As ferramentas Coh-Metrix e Coh-Metrix-Port calculam parâmetros de coesão, coerência e inteligibilidade textual em diferentes níveis linguísticos, e as métricas estudadas foram as linguística e gramaticalmente equivalentes entre as duas línguas. Foi realizado também um teste estatístico (t-Student), para cada métrica e entre as traduções, para avaliar a diferença entre as médias significativas dentre resultados obtidos. Por fim, são introduzidas tecnologias tipicamente usadas em Linguística Computacional, como a Aprendizagem de Máquina (AM), para o aprofundamento da análise. Os resultados indicam que as traduções para o português produziram textos mais complexos do que seus textos-fonte em algumas das medidas analisadas, e que tais traduções não são adequadas para leitores com nível de letramento básico. Além disso, o índice Flesch de legibilidade mostrou-se como a medida mais discriminante entre textos traduzidos do inglês para o português brasileiro e textos escritos originalmente em português. Conclui-se que é importante: a) revisar equivalências de medidas de complexidade entre o sistema Coh-Metrix para o inglês e para o português; b) propor medidas específicas das línguas estudadas; e c) ampliar os critérios de adequação para além do nível lexical. / This work analyzes textual complexity and readability patterns from a computational perspective, situating the problem through the description of original and translated texts, based on Translation Studies, Corpus Linguistics and Natural Language Processing theoretical postulates. We investigated the hypothesis that there are English literature translations made in Brazil that tend to generate more complex texts than their originals, considering – as parameter – the typical Brazilian reader, whose reading skills are at a basic level according to official data. To test this hypothesis, we processed –using the Coh-Metrix and Coh-Metrix-Port tools – a set of literary short stories by various authors in English and their translations into Brazilian Portuguese, and – as contrast – a set of short stories by Brazilian literature authors from the same period and their translations into English. The Coh-Metrix and Coh-Metrix-Port tools calculate cohesion, coherence and textual intelligibility parameters at different linguistic levels, and the metrics studied were the linguistic and grammatical equivalents between the two languages. We also carried out a statistical test (t-test) for each metric, and between translations, to assess whether the difference between the mean results are significant. Finally, we introduced Computational Linguistics methods such as Machine Learning, to improve the results obtained with the mentioned tools. The results indicate that translations into Portuguese are more complex than their source texts in some of the measures analyzed and they are not suitable for readers with basic reading skills. We conclude that it is important to: a) review complexity metrics of equivalence between Coh-Metrix system for English and Portuguese; b) propose specific metrics for the languages studied, and c) expand the criteria of adaptation beyond the lexical level.

Page generated in 0.0854 seconds