Spelling suggestions: "subject:"text simplification"" "subject:"text implification""
1 |
Machine Translation and Text Simplification EvaluationTapkanova, Elmira 01 January 2016 (has links)
Machine translation translates a text from one language to another, while text simplification converts a text from its original form to a simpler one, usually in the same language. This survey paper discusses the evaluation (manual and automatic) of both fields, providing an overview of existing metrics along with their strengths and weaknesses. The first chapter takes an in-depth look at machine translation evaluation metrics, namely BLEU, NIST, AMBER, LEPOR, MP4IBM1, TER, MMS, METEOR, TESLA, RTE, and HTER. The second chapter focuses more generally on text simplification, starting with a discussion of the theoretical underpinnings of the field (i.e what ``simple'' means). Then, an overview of automatic evaluation metrics, namely BLEU and Flesch-Kincaid, is given, along with common approaches to text simplification. The paper concludes with a discussion of the future trajectory of both fields.
|
2 |
Improved Automatic Text Simplification by Manual Training / Förbättrad automatisk textförenkling genom manuell träningRennes, Evelina January 2015 (has links)
The purpose of this thesis was the further development of a rule set used in an automatic text simplification system, and the exploration of whether it is possible to improve the performance of a rule based text simplification system by manual training. A first rule set was developed from a thor- ough literature review, and the rule refinement was performed by manually adapting the first rule set to a set of training texts. When there was no more change added to the set of rules, the training was considered to be completed, and the two sets were applied to a test set, for evaluation. This thesis evaluated the performance of a text simplification system as a clas- sification task, by the use of objective metrics: precision and recall. The comparison of the rule sets revealed a clear improvement of the system, since precision increased from 45% to 82%, and recall increased from 37% to 53%. Both recall and precision was improved after training for the ma- jority of the rules, with a few exceptions. All rule types resulted in a higher score on correctness for R2. Automatic text simplification systems target- ing real life readers need to account for qualitative aspects, which has not been considered in this thesis. Future evaluation should, in addition to quantitative metrics such as precision, recall, and complexity metrics, also account for the experience of the reader.
|
3 |
New data-driven approaches to text simplificationŠtajner, Sanja January 2016 (has links)
No description available.
|
4 |
Moving Beyond Readability Metrics for Health-Related Text SimplificationKauchak, David, Leroy, Gondy January 2016 (has links)
Limited health literacy is a barrier to understanding health information. Simplifying text can reduce this barrier and possibly other known disparities in health. Unfortunately, few tools exist to simplify text with demonstrated impact on comprehension. By leveraging modern data sources integrated with natural language processing algorithms, we are developing the first semi-automated text simplification tool. We present two main contributions. First, we introduce our evidence-based development strategy for designing effective text simplification software and summarize initial, promising results. Second, we present a new study examining existing readability formulas, which are the most commonly used tools for text simplification in healthcare. We compare syllable count, the proxy for word difficulty used by most readability formulas, with our new metric ‘term familiarity’ and find that syllable count measures how difficult words ‘appear’ to be, but not their actual difficulty. In contrast, term familiarity can be used to measure actual difficulty.
|
5 |
New data-driven approaches to text simplificationŠtajner, Sanja January 2015 (has links)
Many texts we encounter in our everyday lives are lexically and syntactically very complex. This makes them difficult to understand for people with intellectual or reading impairments, and difficult for various natural language processing systems to process. This motivated the need for text simplification (TS) which transforms texts into their simpler variants. Given that this is still a relatively new research area, many challenges are still remaining. The focus of this thesis is on better understanding the current problems in automatic text simplification (ATS) and proposing new data-driven approaches to solving them. We propose methods for learning sentence splitting and deletion decisions, built upon parallel corpora of original and manually simplified Spanish texts, which outperform the existing similar systems. Our experiments in adaptation of those methods to different text genres and target populations report promising results, thus offering one possible solution for dealing with the scarcity of parallel corpora for text simplification aimed at specific target populations, which is currently one of the main issues in ATS. The results of our extensive analysis of the phrase-based statistical machine translation (PB-SMT) approach to ATS reject the widespread assumption that the success of that approach largely depends on the size of the training and development datasets. They indicate more influential factors for the success of the PB-SMT approach to ATS, and reveal some important differences between cross-lingual MT and the monolingual v MT used in ATS. Our event-based system for simplifying news stories in English (EventSimplify) overcomes some of the main problems in ATS. It does not require a large number of handcrafted simplification rules nor parallel data, and it performs significant content reduction. The automatic and human evaluations conducted show that it produces grammatical text and increases readability, preserving and simplifying relevant content and reducing irrelevant content. Finally, this thesis addresses another important issue in TS which is how to automatically evaluate the performance of TS systems given that access to the target users might be difficult. Our experiments indicate that existing readability metrics can successfully be used for this task when enriched with human evaluation of grammaticality and preservation of meaning.
|
6 |
Extração de termos de manuais técnicos de produtos tecnológicos: uma aplicação em Sistemas de Adaptação Textual / Term extraction from technological products instruction manuals: an application in textual adaptation systemsMuniz, Fernando Aurélio Martins 28 April 2011 (has links)
No Brasil, cerca de 68% da população é classificada como leitores com baixos níveis de alfabetização, isto é, possuem o nível de alfabetização rudimentar (21%) ou básico (47%), segundo dados do INAF (2009). O projeto PorSimples utilizou as duas abordagens de Adaptação Textual, a Simplificação e a Elaboração, para ajudar leitores com baixo nível de alfabetização a compreender documentos disponíveis na Web em português do Brasil, principalmente textos jornalísticos. Esta pesquisa de mestrado também se dedicou às duas abordagens acima, mas o foco foi o gênero de textos instrucionais. Em tarefas que exigem o uso de documentação técnica, a qualidade da documentação é um ponto crítico, pois caso a documentação seja imprecisa, incompleta ou muito complexa, o custo da tarefa ou até mesmo o risco de acidentes aumenta muito. Manuais de instrução possuem duas relações procedimentais básicas: a relação gera generation (quando uma ação gera automaticamente uma ação ), e a relação habilita enablement (quando a realização de uma ação permite a realização da ação , mas o agente precisa fazer algo a mais para garantir que irá ocorrer). O projeto aqui descrito, intitulado NorMan, estudou como as relações procedimentais gera e habilita são realizadas em manuais de instruções, dando base para a criação do sistema NorMan Extractor, que implementa um método de extração de termos dedicado ao gênero de textos instrucionais, especificamente aos manuais técnicos. Também foi proposta a adaptação do sistema de autoria de textos simplificados criado no projeto PorSimples o SIMPLIFICA para atender o gênero de textos instrucional. O SIMPLIFICA adaptado usa a lista de candidatos a termo, gerada pelo sistema NorMan Extractor, com duas funções: (a) para auxiliar na identificação de palavras que não devem ser simplificadas pelo método de simplificação léxica baseado em sinônimos, e (b) para gerar uma elaboração léxica para facilitar o entendimento do texto / In Brazil, 68% of the population can be classified as low-literacy readers, i.e., people at the rudimentary (21%) and basic (47%) literacy levels, according to the National Indicator of Functional Literacy (INAF, 2009). The PorSimples project used the two approaches of Textual Adaptation, Simplification and Elaboration, to help readers with low-literacy levels to understand Brazilian Portuguese documents on the Web, mainly newspaper articles. In this research we also used the two approaches above, but the focus was the genre of instructional texts. In tasks requiring the use of technical documentation, the quality of documentation is a critical point, because if the documentation is inaccurate, incomplete or too complex, the cost of the task or even the risk of accidents is greatly increased. Instructions manuals have two basic procedural relationships: the relation generation (by performing one of the actions (), the other () will automatically occur), and the relation enablement (when enables , then the agent needs to do something more than to guarantee that will be done). The project presented here, entitled NorMan, investigated the realization of the relationships between procedural actions in instruction manuals, providing the basis for creating an automatic term extraction method devoted to the genre of instructional texts, specifically technical manuals. We also proposed an adaptation of the authoring system of simplified texts created in the project PorSimples - the SIMPLIFICA - to deals with the genre of instrumental texts. The new SIMPLIFICA uses the list of term candidates, generated by the proposed method, with two functions: (a) to assist in the identification of words that should not be simplified by the lexical simplification method based on synonyms, and (b) to generate a lexical elaboration to facilitate the comprehension of the text
|
7 |
Corpop : um corpus de referência do português popular escrito do BrasilPasqualini, Bianca Franco January 2018 (has links)
Esta tese propõe um corpus do Português popular brasileiro escrito, denominado CorPop, com textos selecionados com base no nível de letramento médio dos leitores do país. As bases teórico-metodológicas do CorPop são interdisciplinares e inserem-se no âmbito dos Estudos da Linguagem e disciplinas afins, como Estudos do Léxico e Linguística de Corpus, Linguística Textual e Psicolinguística, dialogando também com estudos de Processamento de Língua Natural. Desse modo, esta investigação abriga-se na Linha de Pesquisa Lexicografia, Terminologia e Tradução: Relações Textuais do PPG-Letras-UFRGS, e nosso recorte, por isso, tende ao destaque para o Léxico. O desenvolvimento do CorPop deu-se através da compilação de dados sobre o nível de letramento dos leitores brasileiros e das características que poderiam compor um padrão de simplicidade textual em um corpus de textos adequados a esses leitores. Tais dados foram coletados das pesquisas do Indicador de Alfabetismo Funcional (INAF) e Retratos da Leitura no Brasil, além de um questionário com leitores. Os textos selecionados para o CorPop são (1) textos do jornalismo popular do Projeto PorPopular (jornal Diário Gaúcho), consumido maciçamente pelas classes C e D, que é o leitor médio brasileiro; (2) textos e autores mais lidos pelos respondentes das últimas edições da pesquisa Retratos da Leitura no Brasil; (3) coleção “É Só o Começo” (adaptação de clássicos da literatura brasileira para leitores com baixo letramento, adaptação esta realizada por linguistas); (4) textos do jornal Boca de Rua, produzido por pessoas em situação de rua, com baixa escolaridade e baixo letramento; e (5) textos do Diário da Causa Operária, imprensa operária brasileira produzida também por pessoas dentro da faixa média de letramento do país. Realizamos, após a coleta, preparação e processamento dos textos do corpus, uma série de experimentos com a lista bruta de frequências e com a lista de frequências lematizada do CorPop. Os resultados obtidos mostram aplicações promissoras do CorPop em diversas tarefas linguísticas, desde simplificação de textos até uso como vocabulário controlado para redação de paráfrases definitórias em dicionários e comprovam que um corpus pequeno pode ter a mesma validade que um corpus de grandes proporções. / This thesis proposes a corpus of Brazilian popular Portuguese written, called CorPop, with texts selected based on the average level of literacy of the country 's readers. CorPop's theoretical and methodological bases are interdisciplinary and fall within the scope of Language Studies and related disciplines, such as Corpus Lexicon and Linguistics Studies, Textual Linguistics and Psycholinguistics, and also dialogues with Natural Language Processing studies. Thus, this research is housed in the Lexicography, Terminology and Translation Research Line: Textual Relations of PPG-Letras-UFRGS, and our cut, therefore, tends to highlight the Lexicon. The development of CorPop took place through the compilation of data about the level of literacy of Brazilian readers and the characteristics that could compose a standard of textual simplicity in a corpus of texts suitable for these readers. These data were collected from the surveys of the Indicator of Functional Literacy (INAF) and Reading Portraits in Brazil, as well as a questionnaire with readers. The texts selected for CorPop are (1) texts of the popular journalism of the PorPopular Project (newspaper Diário Gaúcho), massively consumed by the C and D classes, which is the average Brazilian reader; (2) texts and authors most read by the respondents of the last editions of the research Retratos da Leitura no Brasil; (3) collection "É Só o Começo" (adaptation of classics from Brazilian literature to readers with low literacy, adaptation by linguists); (4) texts of the newspaper Boca de Rua, produced by street people, with low schooling and low literacy; and (5) texts of the Diário da Causa Operária, the Brazilian working press produced also by people within the average literacy range of the country. After the collection, preparation and processing of the texts of the corpus, a series of experiments with the crude list of frequencies and the list of frequencies typed in CorPop. The results obtained show promising applications of CorPop in several linguistic tasks, such as text simplification and use as controlled vocabulary for writing definitions in dictionaries. Also, CorPop proves that a small corpus can have the same validity as a corpus of large proportions.
|
8 |
Is Simple Wikipedia simple? : – A study of readability and guidelinesIsaksson, Fabian January 2018 (has links)
Creating easy-to-read text is an issue that has traditionally been solved with manual work. But with advancing research in natural language processing, automatic systems for text simplification are being developed. These systems often need training data that is parallel aligned. For several years, simple Wikipedia has been the main source for this data. In the current study, several readability measures has been tested on a popular simplification corpus. A selection of guidelines from simple Wikipedia has also been operationalized and tested. The results imply that the following of guidelines are not greater in simple Wikipedia than in standard Wikipedia. There are however differences in the readability measures. The syntactical structures of simple Wikipedia seems to be less complex than those of standard Wikipedia. A continuation of this study would be to examine other readability measures and evaluate the guidelines not covered within the current work.
|
9 |
Corpop : um corpus de referência do português popular escrito do BrasilPasqualini, Bianca Franco January 2018 (has links)
Esta tese propõe um corpus do Português popular brasileiro escrito, denominado CorPop, com textos selecionados com base no nível de letramento médio dos leitores do país. As bases teórico-metodológicas do CorPop são interdisciplinares e inserem-se no âmbito dos Estudos da Linguagem e disciplinas afins, como Estudos do Léxico e Linguística de Corpus, Linguística Textual e Psicolinguística, dialogando também com estudos de Processamento de Língua Natural. Desse modo, esta investigação abriga-se na Linha de Pesquisa Lexicografia, Terminologia e Tradução: Relações Textuais do PPG-Letras-UFRGS, e nosso recorte, por isso, tende ao destaque para o Léxico. O desenvolvimento do CorPop deu-se através da compilação de dados sobre o nível de letramento dos leitores brasileiros e das características que poderiam compor um padrão de simplicidade textual em um corpus de textos adequados a esses leitores. Tais dados foram coletados das pesquisas do Indicador de Alfabetismo Funcional (INAF) e Retratos da Leitura no Brasil, além de um questionário com leitores. Os textos selecionados para o CorPop são (1) textos do jornalismo popular do Projeto PorPopular (jornal Diário Gaúcho), consumido maciçamente pelas classes C e D, que é o leitor médio brasileiro; (2) textos e autores mais lidos pelos respondentes das últimas edições da pesquisa Retratos da Leitura no Brasil; (3) coleção “É Só o Começo” (adaptação de clássicos da literatura brasileira para leitores com baixo letramento, adaptação esta realizada por linguistas); (4) textos do jornal Boca de Rua, produzido por pessoas em situação de rua, com baixa escolaridade e baixo letramento; e (5) textos do Diário da Causa Operária, imprensa operária brasileira produzida também por pessoas dentro da faixa média de letramento do país. Realizamos, após a coleta, preparação e processamento dos textos do corpus, uma série de experimentos com a lista bruta de frequências e com a lista de frequências lematizada do CorPop. Os resultados obtidos mostram aplicações promissoras do CorPop em diversas tarefas linguísticas, desde simplificação de textos até uso como vocabulário controlado para redação de paráfrases definitórias em dicionários e comprovam que um corpus pequeno pode ter a mesma validade que um corpus de grandes proporções. / This thesis proposes a corpus of Brazilian popular Portuguese written, called CorPop, with texts selected based on the average level of literacy of the country 's readers. CorPop's theoretical and methodological bases are interdisciplinary and fall within the scope of Language Studies and related disciplines, such as Corpus Lexicon and Linguistics Studies, Textual Linguistics and Psycholinguistics, and also dialogues with Natural Language Processing studies. Thus, this research is housed in the Lexicography, Terminology and Translation Research Line: Textual Relations of PPG-Letras-UFRGS, and our cut, therefore, tends to highlight the Lexicon. The development of CorPop took place through the compilation of data about the level of literacy of Brazilian readers and the characteristics that could compose a standard of textual simplicity in a corpus of texts suitable for these readers. These data were collected from the surveys of the Indicator of Functional Literacy (INAF) and Reading Portraits in Brazil, as well as a questionnaire with readers. The texts selected for CorPop are (1) texts of the popular journalism of the PorPopular Project (newspaper Diário Gaúcho), massively consumed by the C and D classes, which is the average Brazilian reader; (2) texts and authors most read by the respondents of the last editions of the research Retratos da Leitura no Brasil; (3) collection "É Só o Começo" (adaptation of classics from Brazilian literature to readers with low literacy, adaptation by linguists); (4) texts of the newspaper Boca de Rua, produced by street people, with low schooling and low literacy; and (5) texts of the Diário da Causa Operária, the Brazilian working press produced also by people within the average literacy range of the country. After the collection, preparation and processing of the texts of the corpus, a series of experiments with the crude list of frequencies and the list of frequencies typed in CorPop. The results obtained show promising applications of CorPop in several linguistic tasks, such as text simplification and use as controlled vocabulary for writing definitions in dictionaries. Also, CorPop proves that a small corpus can have the same validity as a corpus of large proportions.
|
10 |
Corpop : um corpus de referência do português popular escrito do BrasilPasqualini, Bianca Franco January 2018 (has links)
Esta tese propõe um corpus do Português popular brasileiro escrito, denominado CorPop, com textos selecionados com base no nível de letramento médio dos leitores do país. As bases teórico-metodológicas do CorPop são interdisciplinares e inserem-se no âmbito dos Estudos da Linguagem e disciplinas afins, como Estudos do Léxico e Linguística de Corpus, Linguística Textual e Psicolinguística, dialogando também com estudos de Processamento de Língua Natural. Desse modo, esta investigação abriga-se na Linha de Pesquisa Lexicografia, Terminologia e Tradução: Relações Textuais do PPG-Letras-UFRGS, e nosso recorte, por isso, tende ao destaque para o Léxico. O desenvolvimento do CorPop deu-se através da compilação de dados sobre o nível de letramento dos leitores brasileiros e das características que poderiam compor um padrão de simplicidade textual em um corpus de textos adequados a esses leitores. Tais dados foram coletados das pesquisas do Indicador de Alfabetismo Funcional (INAF) e Retratos da Leitura no Brasil, além de um questionário com leitores. Os textos selecionados para o CorPop são (1) textos do jornalismo popular do Projeto PorPopular (jornal Diário Gaúcho), consumido maciçamente pelas classes C e D, que é o leitor médio brasileiro; (2) textos e autores mais lidos pelos respondentes das últimas edições da pesquisa Retratos da Leitura no Brasil; (3) coleção “É Só o Começo” (adaptação de clássicos da literatura brasileira para leitores com baixo letramento, adaptação esta realizada por linguistas); (4) textos do jornal Boca de Rua, produzido por pessoas em situação de rua, com baixa escolaridade e baixo letramento; e (5) textos do Diário da Causa Operária, imprensa operária brasileira produzida também por pessoas dentro da faixa média de letramento do país. Realizamos, após a coleta, preparação e processamento dos textos do corpus, uma série de experimentos com a lista bruta de frequências e com a lista de frequências lematizada do CorPop. Os resultados obtidos mostram aplicações promissoras do CorPop em diversas tarefas linguísticas, desde simplificação de textos até uso como vocabulário controlado para redação de paráfrases definitórias em dicionários e comprovam que um corpus pequeno pode ter a mesma validade que um corpus de grandes proporções. / This thesis proposes a corpus of Brazilian popular Portuguese written, called CorPop, with texts selected based on the average level of literacy of the country 's readers. CorPop's theoretical and methodological bases are interdisciplinary and fall within the scope of Language Studies and related disciplines, such as Corpus Lexicon and Linguistics Studies, Textual Linguistics and Psycholinguistics, and also dialogues with Natural Language Processing studies. Thus, this research is housed in the Lexicography, Terminology and Translation Research Line: Textual Relations of PPG-Letras-UFRGS, and our cut, therefore, tends to highlight the Lexicon. The development of CorPop took place through the compilation of data about the level of literacy of Brazilian readers and the characteristics that could compose a standard of textual simplicity in a corpus of texts suitable for these readers. These data were collected from the surveys of the Indicator of Functional Literacy (INAF) and Reading Portraits in Brazil, as well as a questionnaire with readers. The texts selected for CorPop are (1) texts of the popular journalism of the PorPopular Project (newspaper Diário Gaúcho), massively consumed by the C and D classes, which is the average Brazilian reader; (2) texts and authors most read by the respondents of the last editions of the research Retratos da Leitura no Brasil; (3) collection "É Só o Começo" (adaptation of classics from Brazilian literature to readers with low literacy, adaptation by linguists); (4) texts of the newspaper Boca de Rua, produced by street people, with low schooling and low literacy; and (5) texts of the Diário da Causa Operária, the Brazilian working press produced also by people within the average literacy range of the country. After the collection, preparation and processing of the texts of the corpus, a series of experiments with the crude list of frequencies and the list of frequencies typed in CorPop. The results obtained show promising applications of CorPop in several linguistic tasks, such as text simplification and use as controlled vocabulary for writing definitions in dictionaries. Also, CorPop proves that a small corpus can have the same validity as a corpus of large proportions.
|
Page generated in 0.1022 seconds