11 |
Depending on VR : Rule-based Text Simplification Based on Dependency RelationsJohansson, Vida January 2017 (has links)
The amount of text that is written and made available increases all the time. However, it is not readily accessible to everyone. The goal of the research presented in this thesis was to develop a system for automatic text simplification based on dependency relations, develop a set of simplification rules for the system, and evaluate the performance of the system. The system was built on a previous tool and developments were made to ensure the that the system could perform the operations necessary for the rules included in the rule set. The rule set was developed by manual adaption of the rules to a set of training texts. The evaluation method used was a classification task with both objective measures (precision and recall) and a subjective measure (correctness). The performance of the system was compared to that of a system based on constituency relations. The results showed that the current system scored higher on both precision (96% compared to 82%) and recall (86% compared to 53%), indicating that the syntactic information dependency relations provide is sufficient to perform text simplification. Further evaluation should account for how helpful the text simplification produced by the current system is for target readers.
|
12 |
Zjednodušování textu v češtině / Text simplification in CzechBurešová, Karolína January 2017 (has links)
This thesis deals with text simplification in Czech, in particular with lexical simplification. Several strategies of complex word identification, substitution generation and substitution ranking are implemented and evaluated. Substitution generation is attempted both in a dictionary-based manner and in an embedding- based manner. Some experiments involving people are also presented, the experiments aim at gaining an in- sight into perceived simplicity/complexity and its factors. The experiments conducted and evaluated include sentence pair comparison and manual text simplification. Both the evaluation results of various strategies and the outcomes of experiments involving humans are described and some future work is suggested. 1
|
13 |
Extração de termos de manuais técnicos de produtos tecnológicos: uma aplicação em Sistemas de Adaptação Textual / Term extraction from technological products instruction manuals: an application in textual adaptation systemsFernando Aurélio Martins Muniz 28 April 2011 (has links)
No Brasil, cerca de 68% da população é classificada como leitores com baixos níveis de alfabetização, isto é, possuem o nível de alfabetização rudimentar (21%) ou básico (47%), segundo dados do INAF (2009). O projeto PorSimples utilizou as duas abordagens de Adaptação Textual, a Simplificação e a Elaboração, para ajudar leitores com baixo nível de alfabetização a compreender documentos disponíveis na Web em português do Brasil, principalmente textos jornalísticos. Esta pesquisa de mestrado também se dedicou às duas abordagens acima, mas o foco foi o gênero de textos instrucionais. Em tarefas que exigem o uso de documentação técnica, a qualidade da documentação é um ponto crítico, pois caso a documentação seja imprecisa, incompleta ou muito complexa, o custo da tarefa ou até mesmo o risco de acidentes aumenta muito. Manuais de instrução possuem duas relações procedimentais básicas: a relação gera generation (quando uma ação gera automaticamente uma ação ), e a relação habilita enablement (quando a realização de uma ação permite a realização da ação , mas o agente precisa fazer algo a mais para garantir que irá ocorrer). O projeto aqui descrito, intitulado NorMan, estudou como as relações procedimentais gera e habilita são realizadas em manuais de instruções, dando base para a criação do sistema NorMan Extractor, que implementa um método de extração de termos dedicado ao gênero de textos instrucionais, especificamente aos manuais técnicos. Também foi proposta a adaptação do sistema de autoria de textos simplificados criado no projeto PorSimples o SIMPLIFICA para atender o gênero de textos instrucional. O SIMPLIFICA adaptado usa a lista de candidatos a termo, gerada pelo sistema NorMan Extractor, com duas funções: (a) para auxiliar na identificação de palavras que não devem ser simplificadas pelo método de simplificação léxica baseado em sinônimos, e (b) para gerar uma elaboração léxica para facilitar o entendimento do texto / In Brazil, 68% of the population can be classified as low-literacy readers, i.e., people at the rudimentary (21%) and basic (47%) literacy levels, according to the National Indicator of Functional Literacy (INAF, 2009). The PorSimples project used the two approaches of Textual Adaptation, Simplification and Elaboration, to help readers with low-literacy levels to understand Brazilian Portuguese documents on the Web, mainly newspaper articles. In this research we also used the two approaches above, but the focus was the genre of instructional texts. In tasks requiring the use of technical documentation, the quality of documentation is a critical point, because if the documentation is inaccurate, incomplete or too complex, the cost of the task or even the risk of accidents is greatly increased. Instructions manuals have two basic procedural relationships: the relation generation (by performing one of the actions (), the other () will automatically occur), and the relation enablement (when enables , then the agent needs to do something more than to guarantee that will be done). The project presented here, entitled NorMan, investigated the realization of the relationships between procedural actions in instruction manuals, providing the basis for creating an automatic term extraction method devoted to the genre of instructional texts, specifically technical manuals. We also proposed an adaptation of the authoring system of simplified texts created in the project PorSimples - the SIMPLIFICA - to deals with the genre of instrumental texts. The new SIMPLIFICA uses the list of term candidates, generated by the proposed method, with two functions: (a) to assist in the identification of words that should not be simplified by the lexical simplification method based on synonyms, and (b) to generate a lexical elaboration to facilitate the comprehension of the text
|
14 |
Enhancing Text Readability Using Deep Learning TechniquesAlkaldi, Wejdan 20 July 2022 (has links)
In the information era, reading becomes more important to keep up with the growing
amount of knowledge. The ability to read a document varies from person to person depending on their skills and knowledge. It also depends on the readability level of the text, whether it matches the reader’s level or not. In this thesis, we propose a system that uses state-of-the-art technology in machine learning and deep learning to classify and simplify a text taking into consideration the reader’s level of reading. The system classifies any text to its equivalent readability level. If the text readability level is higher than the reader’s level, i.e. too difficult to read, the system performs text simplification to meet the desired readability level. The classification and simplification models are trained on data annotated with readability levels from in the Newsela corpus. The trained simplification model performs at sentence level, to simplify a given text to match a specific readability level. Moreover, the trained classification model is used to classify more unlabelled sentences using Wikipedia Corpus and Mechanical Turk Corpus in order to enrich the text simplification dataset. The augmented dataset is then used to improve the quality of the simplified sentences. The system generates simplified versions of a text based on the desired readability levels. This can help people with low literacy to read and understand any documents they need. It can also be beneficial to educators who assist readers with different reading levels.
|
15 |
Complex Word Identification for SwedishSmolenska, Greta January 2018 (has links)
Complex Word Identification (CWI) is a task of identifying complex words in text data and it is often viewed as a subtask of Automatic Text Simplification (ATS) where the main task is making a complex text simpler. The ways in which a text should be simplified depend on the target readers such as second language learners or people with reading disabilities. In this thesis, we focus on Complex Word Identification for Swedish. First, in addition to exploring existing resources, we collect a new dataset for Swedish CWI. We continue by building several classifiers of Swedish simple and complex words. We then use the findings to analyze the characteristics of lexical complexity in Swedish and English. Our method for collecting training data based on second language learning material has shown positive evaluation scores and resulted in a new dataset for Swedish CWI. Additionally, the built complex word classifiers have an accuracy at least as good as similar systems for English. Finally, the analysis of the selected features confirms the findings of previous studies and reveals some interesting characteristics of lexical complexity.
|
16 |
Text simplification in Swedish using transformer-based neural networks / Textförenkling på Svenska med transformer-baserade neurala nätverkSöderberg, Samuel January 2023 (has links)
Textförenkling innebär modifiering av text så att den blir lättare att läsa genom ersättning av komplexa ord, ändringar av satsstruktur och/eller borttagning av onödig information. Forskning existerar kring textförenkling på svenska, men användandet av neurala nätverk inom området är begränsat. Neurala nätverk kräver storaskaliga och högkvalitativa dataset, men sådana dataset är sällsynta för textförenkling på svenska. Denna studie undersöker framtagning av dataset för textförenkling på svenska genom parafrasutvinning från webbsidor och genom översättning av existerande dataset till svenska, och hur neurala nätverk tränade på sådana dataset presterar. Tre dataset med sekvenspar av komplexa och motsvarande simpla sekvenser skapades, den första genom parafrasutvinning från web data, det andra genom översättning av ett dataset från engelska till svenska, och ett tredje genom att kombinera de framtagna dataseten till ett. Dessa dataset användes sedan för att finjustera ett neuralt vätverk av BART modell, förtränad på stora mängder svensk data. Utvärdering av de tränade modellerna utfördes sedan genom en manuell undersökning och kategorisering av output, och en automatiserad bedömning med mätverktygen SARI och LIX. Två olika dataset för testning skapades och användes i utvärderingen, ett översatt från engelska och ett manuellt framtaget från svenska texter. Den automatiska utvärderingen med SARI gav resultat nära, men inte lika bra, som liknande forskning inom textförenkling på engelska. Utvärderingen med LIX gav resultat på liknande nivå eller bättre än nuvarande forskning inom textförenkling på svenska. Den manuella utvärderingen visade att modellen tränad på datat från parafrasutvinningen oftast producerade korta sekvenser med många ändringar jämfört med originalet, medan modellen tränad på det översatta datasetet oftast producerade oförändrade sekvenser och/eller sekvenser med få ändringar. Dock visade det sig att modellen tränad på de utvunna paragraferna producerade många fler oanvändbara sekvenser än vad modellen tränad på det översatta datasetet gjorde. Modellen tränad på det kombinerade datasetet presterade mellan de två andra modellerna i dessa två avseenden, då den producerade färre oanvändbara sekvenser än modellen tränad på de utvunna paragraferna och färre oförändrade sekvenser jämfört med modellen tränad på det översatta datat. Många sekvenser förenklades bra med de tre modellerna, men den manuella utvärderingen visade att en signifikant andel av de genererade sekvenserna förblev oförändrade eller oanvändbara, vilket belyser behovet av ytterligare forskning, utforskning av metoder, och förfinande av de använda verktygen. / Text simplification involves modifying text to make it easier to read by replacing complex words, altering sentence structure, and/or removing unnecessary information. It can be used to make text more accessible to a larger crowd. While research in text simplification exists for Swedish, the use of neural networks in the field is limited. Neural networks require large-scale high-quality datasets, but such datasets are scarce for text simplification in Swedish. This study investigates the acquisition of datasets through paraphrase mining from web snapshots and translation of existing datasets for text simplification in English to Swedish and aims to assess the performance of neural network models trained on such acquired datasets. Three datasets with complex-to-simple sequence pairs were created, one through mining paraphrases from web data, another by translating a dataset from English to Swedish, and a third by combining the acquired mined and translated datasets into one. These datasets were then used to fine-tune a BART neural network model pre-trained on large amounts of Swedish data. An evaluation was conducted through manual examination and categorization of output, and automated assessment using the SARI and LIX metrics. Two different test sets were evaluated, one translated from English and one manually constructed from Swedish texts. The automatic evaluation produced SARI scores close to, but not as well as, similar research in text simplification in English. When considering LIX scores, the models perform on par or better than existing research into automatic text simplification in Swedish. The manual evaluation revealed that the model trained on the mined paraphrases generally produced short sequences that had many alterations compared to the original, while the translated dataset often produced unchanged sequences and sequences with few alterations. However, the model trained on the mined dataset produced many more sequences that were unusable, either with corrupted Swedish or by altering the meaning of the sequences, compared to the model trained on the translated dataset. The model trained on the combined dataset reached a middle ground in these two regards, producing fewer unusable sequences than the model trained on the mined dataset and fewer unchanged sequences compared to the model trained on the translated dataset. Many sequences were successfully simplified using the three models, but the manual evaluation revealed that a significant portion of the generated sequences remains unchanged or unusable, highlighting the need for further research, exploration of methods, and tool refinement.
|
17 |
Context-aware Swedish Lexical Simplification : Using pre-trained language models to propose contextually fitting synonyms / Kontextmedveten lexikal förenkling på svenska : Användningen av förtränade språkmodeller för att föreslå kontextuellt passande synonymer.Graichen, Emil January 2023 (has links)
This thesis presents the development and evaluation of context-aware Lexical Simplification (LS) systems for the Swedish language. In total three versions of LS models, LäsBERT, LäsBERT-baseline, and LäsGPT, were created and evaluated on a newly constructed Swedish LS evaluation dataset. The LS systems demonstrated promising potential in aiding audiences with reading difficulties by providing context-aware word replacements. While there were areas for improvement, particularly in complex word identification, the systems showed agreement with human annotators on word replacements. The effects of fine-tuning a BERT model for substitution generation on easy-to-read texts were explored, indicating no significant difference in the number of replacements between fine-tuned and non-fine-tuned versions. Both versions performed similarly in terms of synonymous and simplifying replacements, although the fine-tuned version exhibited slightly reduced performance compared to the baseline model. An important contribution of this thesis is the creation of an evaluation dataset for Lexical Simplification in Swedish. The dataset was automatically collected and manually annotated. Evaluators assessed the quality, coverage, and complexity of the dataset. Results showed that the dataset had high quality and a perceived good coverage. Although the complexity of the complex words was perceived to be low, the dataset provides a valuable resource for evaluating LS systems and advancing research in Swedish Lexical Simplification. Finally, a more transparent and reader-empowering approach to Lexical Simplification isproposed. This new approach embraces the challenges with contextual synonymy and reduces the number of failure points in the conventional LS pipeline, increasing the chancesof developing a fully meaning-preserving LS system. Links to different parts of the project can be found here: The Lexical Simplification dataset: https://github.com/emilgraichen/SwedishLSdataset The lexical simplification algorithm: https://github.com/emilgraichen/SwedishLexicalSimplifier
|
18 |
Controllable sentence simplification in Swedish : Automatic simplification of sentences using control prefixes and mined Swedish paraphrasesMonsen, Julius January 2023 (has links)
The ability to read and comprehend text is essential in everyday life. Some people, including individuals with dyslexia and cognitive disabilities, may experience difficulties with this. Thus, it is important to make textual information accessible to diverse target audiences. Automatic Text Simplification (ATS) techniques aim to reduce the linguistic complexity in texts to facilitate readability and comprehension. However, existing ATS systems often lack customization to specific user needs, and simplification data for languages other than English is limited. This thesis addressed ATS in a Swedish context, building upon novel methods that provide more control over the simplification generation process, enabling user customization. A dataset of Swedish paraphrases was mined from a large amount of text data. ATS models were then trained on this dataset utilizing prefix-tuning with control prefixes. Two sets of text attributes and their effects on performance were explored for controlling the generation. The first had been used in previous research, and the second was extracted in a data-driven way from existing text complexity measures. The trained ATS models for Swedish and additional models for English were evaluated and compared using SARI and BLEU metrics. The results for the English models were consistent with results from previous research using controllable generation mechanisms, although slightly lower. The Swedish models provided significant improvements over the baseline, in the form of a fine-tuned BART model, and compared to previous Swedish ATS results. These results highlight the efficiency of using paraphrase data paired with controllable generation mechanisms for simplification. Furthermore, the different sets of attributes provided very similar results, pointing to the fact that both these sets of attributes manage to capture aspects of simplification. The process of mining paraphrases, selecting control attributes and other methodological implications are discussed, leading to suggestions for future research.
|
19 |
Text Simplification and Keyphrase Extraction for SwedishLindqvist, Ellinor January 2019 (has links)
Attempts have been made in Sweden to increase readability for texts addressed to the public, and ongoing projects are still being conducted by disability associations, private companies and Swedish authorities. In this thesis project, we explore automatic approaches to increase readability trough text simplification and keyphrase extraction, with the goal of facilitating text comprehension and readability for people with reading difficulties. A combination of handwritten rules and monolingual machine translation was used to simplify the syntactic and lexical content of Swedish texts, and noun phrases were extracted to provide the reader with a short summary of the textual content. A user evaluation was conducted to compare the original and the simplified version of the same text. Several texts and their simplified versions were also evaluated using established readability metrics. Although a manual evaluation of the result showed that the implemented rules generally worked as intended on the sentences that were targeted, the results from the user evaluation and readability metrics did not show improvements. We believe that further additions to the rule set, targeting a wider range of linguistic structures, have the potential to improve the results.
|
20 |
Analyse contrastive des verbes dans des corpus médicaux et création d’une ressource verbale de simplification de textes / Automatic analysis of verbs in texts of medical corpora : theoretical and applied issuesWandji Tchami, Ornella 26 February 2018 (has links)
Grâce à l’évolution de la technologie à travers le Web, la documentation relative à la santé est de plus en plus abondante et accessible à tous, plus particulièrement aux patients, qui ont ainsi accès à une panoplie d’informations sanitaires. Malheureusement, la grande disponibilité de l’information médicale ne garantit pas sa bonne compréhension par le public visé, en l’occurrence les non-experts. Notre projet de thèse a pour objectif la création d’une ressource de simplification de textes médicaux, à partir d’une analyse syntaxico-sémantique des verbes dans quatre corpus médicaux en français qui se distinguent de par le degré d’expertise de leurs auteurs et celui des publics cibles. La ressource conçue contient 230 patrons syntaxicosémantiques des verbes (appelés pss), alignés avec leurs équivalents non spécialisés. La méthode semi-automatique d’analyse des verbes appliquée pour atteindre notre objectif est basée sur quatre tâches fondamentales : l’annotation syntaxique des corpus, réalisée grâce à l’analyseur syntaxique Cordial (Laurent, Dominique et al, 2009) ; l’annotation sémantique des arguments des verbes, à partir des catégories sémantiques de la version française de la terminologie médicale Snomed Internationale (Côté, 1996) ; l’acquisition des patrons syntactico-sémantiqueset l’analyse contrastive du fonctionnement des verbes dans les différents corpus. Les patrons syntaxico-sémantiques des verbes acquis au terme de ce processus subissent une évaluation (par trois équipes d’experts en médecine) qui débouche sur la sélection des candidats constituant la nomenclature de la ressource de simplification. Les pss sont ensuite alignés avec leurs correspondants non spécialisés, cet alignement débouche sur le création de la ressource de simplification, qui représente le résultat principal de notre travail de thèse. Une évaluation du rendement du contenu de la ressource a été effectuée avec deux groupes d’évaluateurs : des linguistes et des non-linguistes. Les résultats montrent que la simplification des pss permet de faciliter la compréhension du sens du verbe en emploi spécialisé, surtout lorsque un certains paramètres sont réunis. / With the evolution of Web technology, healthcare documentation is becoming increasinglyabundant and accessible to all, especially to patients, who have access to a large amount ofhealth information. Unfortunately, the ease of access to medical information does not guaranteeits correct understanding by the intended audience, in this case non-experts. Our PhD work aimsat creating a resource for the simplification of medical texts, based on a syntactico-semanticanalysis of verbs in four French medical corpora, that are distinguished according to the levelof expertise of their authors and that of the target audiences. The resource created in thepresent thesis contains 230 syntactico-semantic patterns of verbs (called pss), aligned withtheir non-specialized equivalents. The semi-automatic method applied, for the analysis of verbs,in order to achieve our goal is based on four fundamental tasks : the syntactic annotation of thecorpora, carried out thanks to the Cordial parser (Laurent et al., 2009) ; the semantic annotationof verb arguments, based on semantic categories of the French version of a medical terminologyknown as Snomed International (Côté, 1996) ; the acquisition of syntactico-semantic patternsof verbs and the contrastive analysis of the verbs behaviors in the different corpora. Thepss, acquired at the end of this process, undergo an evaluation (by three teams of medicalexperts) which leads to the selection of candidates constituting the nomenclature of our textsimplification resource. These pss are then aligned with their non-specialized equivalents, thisalignment leads to the creation of the simplification resource, which is the main result of ourPhD study. The content of the resource was evaluated by two groups of people : linguists andnon-linguists. The results show that the simplification of pss makes it easier for non-expertsto understand the meaning of verbs used in a specialized way, especially when a certain set ofparameters is collected.
|
Page generated in 0.1206 seconds