• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 7
  • 4
  • 4
  • 3
  • 3
  • 1
  • 1
  • 1
  • Tagged with
  • 24
  • 6
  • 5
  • 5
  • 5
  • 5
  • 5
  • 5
  • 5
  • 4
  • 4
  • 4
  • 4
  • 4
  • 4
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
11

An eye-tracking study on synonym replacement / En ögonrörelsestudie på synonymutbyte

Svensson, Cassandra January 2015 (has links)
As the amount of information increase, the need for automatic textsimplication also increase. There are some strategies for doing thatand this thesis has studied two basic synonym replacement strategies.The rst one is called word length and is about always choosinga shorter synonym if it is possible. The second one is called wordfrequency and is about always choosing a more frequent synonym if itis possible. Three dierent versions of them were tried. The rst onewas about just choosing the shortest or most frequent synonym. Thesecond was about only choosing a synonym if it was extremely shorteror more frequent. The last was about only choosing a synonym if itmet the requirements for being replaced and was on synonym level 5.Statistical analysis of the data revealed no signicant dierence. Butsmall trends showed that always choosing a more frequent synonymthat is of level 5 seemed to make the text a bit easier.
12

Automatic Text Simplification via Synonym Replacement / Automatiskt textförenkling genom synonymutbyte

Keskisärkkä, Robin January 2012 (has links)
In this study automatic lexical simplification via synonym replacement in Swedish was investigated using three different strategies for choosing alternative synonyms: based on word frequency, based on word length, and based on level of synonymy. These strategies were evaluated in terms of standardized readability metrics for Swedish, average word length, proportion of long words, and in relation to the ratio of errors (type A) and number of replacements. The effect of replacements on different genres of texts was also examined. The results show that replacement based on word frequency and word length can improve readability in terms of established metrics for Swedish texts for all genres but that the risk of introducing errors is high. Attempts were made at identifying criteria thresholds that would decrease the ratio of errors but no general thresholds could be identified. In a final experiment word frequency and level of synonymy were combined using predefined thresholds. When more than one word passed the thresholds word frequency or level of synonymy was prioritized. The strategy was significantly better than word frequency alone when looking at all texts and prioritizing level of synonymy. Both prioritizing frequency and level of synonymy were significantly better for the newspaper texts. The results indicate that synonym replacement on a one-to-one word level is very likely to produce errors. Automatic lexical simplification should therefore not be regarded a trivial task, which is too often the case in research literature. In order to evaluate the true quality of the texts it would be valuable to take into account the specific reader. A simplified text that contains some errors but which fails to appreciate subtle differences in terminology can still be very useful if the original text is too difficult to comprehend to the unassisted reader.
13

The Importance of Being Integrative: A Remarkable Case of Synonymy in the Genus Viridiscus (Heterotardigrada: Echiniscidae)

Gąsiorek, Piotr, Vončina, Katarzyna, Nelson, Diane R., Michalczyk, Łukasz 20 November 2021 (has links)
There are two predominant sources of taxonomically useful morphological variability in the diverse tardigrade family Echiniscidae: the internal structure and surface sculpture of the cuticular plates covering the dorsum (sculpturing) and the arrangement and morphology of the trunk appendages (chaetotaxy). However, since the appendages often exhibit intraspecific variation (they can be reduced or can develop asymmetrically), sculpturing has been considered more stable at the species level and descriptions of new echiniscid species based solely on morphology are still being published. Here, we present a case study in which a detailed analysis of the morphology and multiple genetic markers of several species of the genus Viridiscus shows that cuticular sculpture may also exhibit considerable intraspecific variation and lead to false taxonomic conclusions. In a population collected from the eastern Nearctic, in the type locality of the recently described species V. miraviridis, individuals with transitional morphotypes between those reported for V. viridissimus and V. miraviridis were found. Importantly, all morphotypes within the viridissimus-miraviridis spectrum were grouped in a single monospecific clade according to rapidly evolving markers (ITS-1, ITS-2 and COI). Given the morphological and genetic evidence, we establish V. miraviridis as a junior synonym of V. viridissimus. This study explicitly demonstrates that a lack of DNA data associated with morphological descriptions of new taxa jeopardizes the efforts to unclutter tardigrade systematics. Additionally, V. perviridis and V. viridissimus are reported from Lâm Đồng Province in southern Vietnam, which considerably broadens their known geographic ranges.
14

Exploring Automatic Synonym Generation for Lexical Simplification of Swedish Electronic Health Records

Jänich, Anna January 2023 (has links)
Electronic health records (EHRs) are used in Sweden's healthcare systems to store patients' medical information. Patients in Sweden have the right to access and read their health records. Unfortunately, the language used in EHRs is very complex and presents a challenge for readers who lack medical knowledge. Simplifying the language used in EHRs could facilitate the transfer of information between medical staff and patients. This project investigates the possibility of generating Swedish medical synonyms automatically. These synonyms are intended to be used in future systems for lexical simplification that can enhance the readability of Swedish EHRs and simplify medical terminology. Current publicly available Swedish corpora that provide synonyms for medical terminology are insufficient in size to be utilized in a system for lexical simplification. To overcome the obstacle of insufficient corpora, machine learning models are trained to generate synonyms and terms that convey medical concepts in a more understandable way. With the purpose of establishing a foundation for analyzing complex medical terms, a simple mechanism for Complex Word Identification (CWI) is implemented. The mechanism relies on matching strings and substrings from a pre-existing corpus containing hand-curated medical terms in Swedish. To find a suitable strategy for generating medical synonyms automatically, seven different machine learning models are queried for synonym suggestions for 50 complex sample terms. To explore the effect of different input data, we trained our models on different datasets with varying sizes. Three of the seven models are based on BERT and four of them are based on Word2Vec. For each model, results for the 50 complex sample terms are generated and raters with medical knowledge are asked to assess whether the automatically generated suggestions could be considered synonyms. The results vary between the different models and seem to be connected to the amount and quality of the data they have been trained on. Furthermore, the raters involved in judging the synonyms exhibit great disagreement, revealing the complexity and subjectivity of the task to find suitable and widely accepted medical synonyms. The method and models applied in this project do not succeed in creating a stable source of suitable synonyms. The chosen BERT approach based on Masked Language Modelling cannot reliably generate suitable synonyms due to the limitation of generating one term per synonym suggestion only. The Word2Vec models demonstrate some weaknesses due to the lack of context consideration. Despite the fact that the current performance of our models in generating automatic synonym suggestions is not entirely satisfactory, we have observed a promising number of accurate suggestions. This gives us reason to believe that with enhanced training and a larger amount of input data consisting of Swedish medical text, the models could be improved and eventually effectively applied.
15

Scribal composition : Malachi as a test-case

Lear, Sheree January 2014 (has links)
The Hebrew Bible is the product of scribes. Whether copying, editing, conflating, adapting, or authoring, these ancient professionals were responsible for the various text designs, constructions and text-types that we have today. This thesis seeks to investigate the many practices employed by ancient scribes in literary production, or, more aptly, scribal composition. An investigation of scribal composition must incorporate inquiry into both synchronic and diachronic aspects of a text; a synchronic viewpoint can clarify diachronic features of the text and a diachronic viewpoint can clarify synchronic features of the text. To understand the text as the product of scribal composition requires recognition that the ancient scribe had a communicative goal when he engaged in the different forms of scribal composition (e.g. authoring, redacting, etc.). This communicative goal was reached through the scribal composer's implementation of various literary techniques. By tracing the reception of a text, it is possible to demonstrate when a scribal composer successfully reached his communicative goal. Using Malachi as a test-case, three autonomous yet complementary chapters will illustrate how investigating the text as the product of scribal composition can yield new and important insights. Chapter 2: Mal 2.10-16 focuses on a particularly difficult portion of Malachi (2.10-16), noting patterns amongst the texts reused in the pericope. These patterns give information about the ancient scribe's view of scripture and about his communicative goal. Chapter 3: Wordplay surveys Malachi for different types of the wordplay. The chapter demonstrates how a poetic feature such as wordplay, generally treated as a synchronic element, can also have diachronic implications. Chapter 4: Phinehas, he is Elijah investigates the reception of Malachi as a finished text. By tracing backwards a tradition found throughout later Jewish literature, it is evident that the literary techniques employed by the composer made his text successfully communicative.
16

Improving RDF data with data mining

Abedjan, Ziawasch January 2014 (has links)
Linked Open Data (LOD) comprises very many and often large public data sets and knowledge bases. Those datasets are mostly presented in the RDF triple structure of subject, predicate, and object, where each triple represents a statement or fact. Unfortunately, the heterogeneity of available open data requires significant integration steps before it can be used in applications. Meta information, such as ontological definitions and exact range definitions of predicates, are desirable and ideally provided by an ontology. However in the context of LOD, ontologies are often incomplete or simply not available. Thus, it is useful to automatically generate meta information, such as ontological dependencies, range definitions, and topical classifications. Association rule mining, which was originally applied for sales analysis on transactional databases, is a promising and novel technique to explore such data. We designed an adaptation of this technique for min-ing Rdf data and introduce the concept of “mining configurations”, which allows us to mine RDF data sets in various ways. Different configurations enable us to identify schema and value dependencies that in combination result in interesting use cases. To this end, we present rule-based approaches for auto-completion, data enrichment, ontology improvement, and query relaxation. Auto-completion remedies the problem of inconsistent ontology usage, providing an editing user with a sorted list of commonly used predicates. A combination of different configurations step extends this approach to create completely new facts for a knowledge base. We present two approaches for fact generation, a user-based approach where a user selects the entity to be amended with new facts and a data-driven approach where an algorithm discovers entities that have to be amended with missing facts. As knowledge bases constantly grow and evolve, another approach to improve the usage of RDF data is to improve existing ontologies. Here, we present an association rule based approach to reconcile ontology and data. Interlacing different mining configurations, we infer an algorithm to discover synonymously used predicates. Those predicates can be used to expand query results and to support users during query formulation. We provide a wide range of experiments on real world datasets for each use case. The experiments and evaluations show the added value of association rule mining for the integration and usability of RDF data and confirm the appropriateness of our mining configuration methodology. / Linked Open Data (LOD) umfasst viele und oft sehr große öffentlichen Datensätze und Wissensbanken, die hauptsächlich in der RDF Triplestruktur bestehend aus Subjekt, Prädikat und Objekt vorkommen. Dabei repräsentiert jedes Triple einen Fakt. Unglücklicherweise erfordert die Heterogenität der verfügbaren öffentlichen Daten signifikante Integrationsschritte bevor die Daten in Anwendungen genutzt werden können. Meta-Daten wie ontologische Strukturen und Bereichsdefinitionen von Prädikaten sind zwar wünschenswert und idealerweise durch eine Wissensbank verfügbar. Jedoch sind Wissensbanken im Kontext von LOD oft unvollständig oder einfach nicht verfügbar. Deshalb ist es nützlich automatisch Meta-Informationen, wie ontologische Abhängigkeiten, Bereichs-und Domänendefinitionen und thematische Assoziationen von Ressourcen generieren zu können. Eine neue und vielversprechende Technik um solche Daten zu untersuchen basiert auf das entdecken von Assoziationsregeln, welche ursprünglich für Verkaufsanalysen in transaktionalen Datenbanken angewendet wurde. Wir haben eine Adaptierung dieser Technik auf RDF Daten entworfen und stellen das Konzept der Mining Konfigurationen vor, welches uns befähigt in RDF Daten auf unterschiedlichen Weisen Muster zu erkennen. Verschiedene Konfigurationen erlauben uns Schema- und Wertbeziehungen zu erkennen, die für interessante Anwendungen genutzt werden können. In dem Sinne, stellen wir assoziationsbasierte Verfahren für eine Prädikatvorschlagsverfahren, Datenvervollständigung, Ontologieverbesserung und Anfrageerleichterung vor. Das Vorschlagen von Prädikaten behandelt das Problem der inkonsistenten Verwendung von Ontologien, indem einem Benutzer, der einen neuen Fakt einem Rdf-Datensatz hinzufügen will, eine sortierte Liste von passenden Prädikaten vorgeschlagen wird. Eine Kombinierung von verschiedenen Konfigurationen erweitert dieses Verfahren sodass automatisch komplett neue Fakten für eine Wissensbank generiert werden. Hierbei stellen wir zwei Verfahren vor, einen nutzergesteuertenVerfahren, bei dem ein Nutzer die Entität aussucht die erweitert werden soll und einen datengesteuerten Ansatz, bei dem ein Algorithmus selbst die Entitäten aussucht, die mit fehlenden Fakten erweitert werden. Da Wissensbanken stetig wachsen und sich verändern, ist ein anderer Ansatz um die Verwendung von RDF Daten zu erleichtern die Verbesserung von Ontologien. Hierbei präsentieren wir ein Assoziationsregeln-basiertes Verfahren, der Daten und zugrundeliegende Ontologien zusammenführt. Durch die Verflechtung von unterschiedlichen Konfigurationen leiten wir einen neuen Algorithmus her, der gleichbedeutende Prädikate entdeckt. Diese Prädikate können benutzt werden um Ergebnisse einer Anfrage zu erweitern oder einen Nutzer während einer Anfrage zu unterstützen. Für jeden unserer vorgestellten Anwendungen präsentieren wir eine große Auswahl an Experimenten auf Realweltdatensätzen. Die Experimente und Evaluierungen zeigen den Mehrwert von Assoziationsregeln-Generierung für die Integration und Nutzbarkeit von RDF Daten und bestätigen die Angemessenheit unserer konfigurationsbasierten Methodologie um solche Regeln herzuleiten.
17

Application and Evaluation of Unified Medical Language System Resources to Facilitate Patient Information Acquisition through Enhanced Vocabulary Coverage

Mills, Eric M. III 26 April 1998 (has links)
Two broad themes of this research are, 1) to develop a generalized framework for studying the process of patient information acquisition and 2) to develop and evaluate automated techniques for identifying domain-specific vocabulary terms contained in, or missing from, a standardized controlled medical vocabulary with emphasis on those terms necessary for representing the canine physical examination. A generalized framework for studying the process of patient information acquisition is addressed by the Patient Information Acquisition Model (PIAM). PIAM illustrates the decision-to-perception chain which links a clinician's decision to collect information, either personally or through another, with the perception of the resulting information. PIAM serves as a framework for a systematic approach to identifying causes of missing or inaccurate information. The vocabulary studies in this research were conducted using free-text with two objectives in mind, 1) develop and evaluate automated techniques for identifying canine physical examination terms contained in the Systematized Nomenclature of Medicine and Veterinary Medicine (SNOMED), version 3.3 and 2) develop and evaluate automated techniques for identifying canine physical examination terms not documented in the 1997 release of the Unified Medical Language System (UMLS). Two lexical matching techniques for identifying SNOMED concepts contained in free-text were evaluated, 1) lexical matching using SNOMED version 3.3 terms alone and 2) Metathesaurus-enhanced lexical matching. Metathesaurus-enhanced lexical matching utilized non-SNOMED terms from the source vocabularies of the Metathesaurus of the Unified Medical Language System to identify SNOMED concepts in free-text using links among synonymous terms contained in the Metathesaurus. Explicit synonym disagreement between the Metathesaurus and its source vocabularies was identified during the Metathesaurus-enhanced lexical matching studies. Explicit synonym disagreement occurs, 1) when terms within a single concept group in a source vocabulary are mapped to multiple Metathesaurus concepts, and 2) when terms from multiple concept groups in a source vocabulary are mapped to a single Metathesaurus concept. Five causes of explicit synonym disagreement between a source vocabulary and the Metathesaurus were identified in this research, 1) errors within a source vocabulary, 2) errors within the Metathesaurus, 3) errors in mapping between the Metathesaurus and a source vocabulary, 4) systematic differences in vocabulary management between the Metathesaurus and a source vocabulary, and 5) differences regarding synonymy among domain experts, based on perspective or context. Three approaches to reconciling differences among domain experts are proposed. First, document which terms are involved. Second, provide a mechanism for selecting either vocabulary-based or Metathesaurus-based synonymy. Third, assign a "basis of synonymy" attribute to each set of synonymous terms in order to identify the perspective or context of synonymy explicitly. The second objective, identifying canine physical examination terms not documented in the 1997 release of the UMLS was accomplished using lexical matching, domain-specific free-text, the Metathesaurus and the SPECIALIST Lexicon. Terms contained in the Metathesaurus and SPECIALIST Lexicon were removed from free-text and the remaining character strings were presented to domain experts along with the original sections of text for manual review. / Ph. D.
18

Word2vec2syn : Synonymidentifiering med Word2vec / Word2vec2syn : Synonym Identification using Word2vec

Pettersson, Tove January 2019 (has links)
Inom NLP (eng. natural language processing) är synonymidentifiering en av de språkvetenskapliga utmaningarna som många antar. Fodina Language Technology AB är ett företag som skapat ett verktyg, Termograph, ämnad att samla termer inom företag och hålla den interna språkanvändningen konsekvent. En metodkombination bestående av språkteknologiska strategier utgör synonymidentifieringen och Fodina önskar ett större täckningsområde samt mer dynamik i framtagningsprocessen. Därav syftade detta arbete till att ta fram en ny metod, utöver metodkombinationen, för just synonymidentifiering. En färdigtränad Word2vec-modell användes och den inbyggda funktionen för cosinuslikheten användes för att få fram synonymer och skapa kluster. Modellen validerades, testades och utvärderades i förhållande till metodkombinationen. Valideringen visade att modellen skattade inom ett rimligt mänskligt spann i genomsnitt 60,30 % av gångerna och Spearmans korrelation visade på en signifikant stark korrelation. Testningen visade att 32 % av de bearbetade klustren innehöll matchande synonymförslag. Utvärderingen visade att i de fall som förslagen inte matchade så var modellens synonymförslag korrekta i 5,73 % av fallen jämfört med 3,07 % för metodkombinationen. Den interna reliabiliteten för utvärderarna visade på en befintlig men svag enighet, Fleiss Kappa = 0,19, CI(0,06, 0,33). Trots viss osäkerhet i resultaten påvisas ändå möjligheter för vidare användning av word2vec-modeller inom Fodinas synonymidentifiering. / One of the main challenges in the field of natural language processing (NLP) is synonym identification. Fodina Language Technology AB is the company behind the tool, Termograph, that aims to collect terms and provide a consistent language within companies. A combination of multiple methods from the field of language technology constitutes the synonym identification and Fodina would like to improve the area of coverage and increase the dynamics of the working process. The focus of this thesis was therefore to evaluate a new method for synonym identification beyond the already used combination. Initially a trained Word2vec model was used and for the synonym identification the built-in-function for cosine similarity was applied in order to create clusters. The model was validated, tested and evaluated relative to the combination. The validation implicated that the model made estimations within a fair human-based range in an average of 60.30% and Spearmans correlation indicated a strong significant correlation. The testing showed that 32% of the processed synonym clusters contained matching synonym suggestions. The evaluation showed that the synonym suggestions from the model was correct in 5.73% of all cases compared to 3.07% for the combination in the cases where the clusters did not match. The interrater reliability indicated a slight agreement, Fleiss’ Kappa = 0.19, CI(0.06, 0.33). Despite uncertainty in the results, opportunities for further use of Word2vec-models within Fodina’s synonym identification are nevertheless demonstrated.
19

以語料庫為本之近似詞教學成效之研究:以台灣大學生為例 / The Effect of Teaching Near-synonyms to Taiwan EFL University Students: A Corpus-based Approach

陳聖其, Chen, Sheng Chi Unknown Date (has links)
台灣英語教育多以考試取向為主,許多教師進行英語字彙指導時採用填鴨式教學,致使學生無法於新的情境靈活使用字彙。 本研究旨在於探究以語料庫為本之教學對於台灣大學生在英語近似詞學習成效的影響,以台北市某一所大學86位英語學習背景及能力相似之大一生為研究對象。研究人數均分成兩班進行教學實驗,一班為實驗組,以資料觀察法進行教學,另一班為對照組,以傳統形式教學為主,每週一次五十分鐘,共進行十週。資料蒐集包含近似詞學習成就測驗前、後測,並且依據研究對象於實驗教學結束後接受語料觀察教學法回饋問卷,蒐集研究對象對於語料觀察法之反應與感知,進行量化分析。最後,透過訪談高分組和低分組學生,蒐集其質性資料進行研究探討哪些因素會影響不同英語能力學生對於資料觀察法的意願與需求。本研究發現如下: 一、近似詞教學有助於提升台灣大學生的英語字彙能力。兩組教學均在後測有 進步。但就後測成績來說,實驗組顯著優於控制組。資料觀察法之近似詞教學 均較傳統教學法更能有效提升學生的英語字彙能力。 二、在不同程度的學生學習成效上,高、低分組學生均在後測成績有進步。對於 高分組而言,實驗組後測成績顯著優於控制組後測。但對於控制組而言,實驗 組的與控制組的後測成績未呈顯著差異。 三、大部分的學生對於運用資料觀察法學習單字均給予正面回饋,也肯定資料觀 察學習法活動的效益。另外,根據高、低分組學生訪談結果發現,英語程度的 高低的確會影響學生對於資料觀察法的喜愛和需求。高分組的學生希望先以資 料觀察學習法為開端,再以傳統講解式方式做總結。但對低分組的學生而言, 喜歡參與小組討論。由於單字量的不足,低分組學生希望在語料庫為主的教材 旁能附上中文解釋,降低學習焦慮。 根據上述研究結果,本研究建議大學英語教師在教學現場能夠融入語料觀察學 習法並依照不同程度的學生進行教材設計,以助提升學生學習英語單字。 關鍵字:資料觀察學習法、近似詞、語料庫為本 / Corpus Linguistics has progressively become the center in different domains of language research. With such development of large corpora, the potential applications and possibilities of corpora in second language teaching and learning are extended. A discovery-based authentic learning environment is provided as well as created by such corpus-based language learning. Synonym or near-synonym learning is a difficult aspect of vocabulary learning, but a linguistic phenomenon with ubiquity. Hence, this research aims to investigate the effectiveness of the application of data-driven learning (DDL) approach in near-synonyms instruction and compare the teaching effect on the high and low achievers through the near-synonyms instruction. Participants of this study were given instruction throughout the eight-week corpus-based teaching with materials compiled by the teacher. This is a quasi-experimental study consisting of comparison between two experimental conditions, with a pre-post test and control-experimental group design, followed by qualitative method of semi-structure interviews and questionnaire provided to the experimental group of EFL university students in Taiwan. Two intact classes of 86 college students participated in this study. The quantitative analysis of the pre- and posttest scores and questionnaire were conducted through descriptive statistics and frequency analysis in order to explain the learning effects and learners’ perceptions. The results of the study revealed that: (1) participants in the experimental group made significant improvement in the posttest; (2) EFL high proficiency learners with DDL approach performed better than high achievers who were taught by the traditional method. However, low achievers may not be able to benefit from DDL approach in the form of concordance teaching materials; (3) the majority of the participants had positive feedback on DDL activities. Also, types of preferred DDL activities were strongly influenced by students’ proficiency level. Low achievers preferred activities that should involve Chinese translation as the supplementary note while as for the high achievers, they were looking for the teacher’s explanation of words’ usages and functions in the end. This study demonstrates the importance in illuminating the dynamic relationship between DDL approach and second language near-synonyms learning, as well as provides English EFL teachers with a better concept to incorporate corpus or concordance lines into vocabulary instruction. Key words: data-driven Learning, near-synonym, corpus-based approach
20

Effekten av textaugmenteringsstrategier på träffsäkerhet, F1-värde och viktat F1-värde / The effect of text data augmentation strategies on Accuracy, F1-score, and weighted F1-score

Svedberg, Jonatan, Shmas, George January 2021 (has links)
Att utveckla en sofistikerad chatbotlösning kräver stora mängder textdata för att kunna anpassalösningen till en specifik domän. Att manuellt skapa en komplett uppsättning textdata, specialanpassat för den givna domänen och innehållandes ett stort antal varierande meningar som en människa kan tänkas yttra, är ett enormt tidskrävande arbete. För att kringgå detta tillämpas dataaugmentering för att generera mer data utifrån en mindre uppsättning redan existerande textdata. Softronic AB vill undersöka alternativa strategier för dataaugmentering med målet att eventuellt ersätta den nuvarande lösningen med en mer vetenskapligt underbyggd sådan. I detta examensarbete har prototypmodeller utvecklats för att jämföra och utvärdera effekten av olika textaugmenteringsstrategier. Resultatet av genomförda experiment med prototypmodellerna visar att augmentering genom synonymutbyten med en domänanpassad synonymordlista, presenterade märkbart förbättrade effekter på förmågan hos en NLU-modell att korrekt klassificera data, gentemot övriga utvärderade strategier. Vidare indikerar resultatet att ett samband föreligger mellan den strukturella variationsgraden av det augmenterade datat och de tillämpade språkparens semantiska likhetsgrad under tillbakaöversättningar. / Developing a sophisticated chatbot solution requires large amounts of text data to be able to adapt the solution to a specific domain. Manually creating a complete set of text data, specially adapted for the given domain, and containing a large number of varying sentences that a human conceivably can express, is an exceptionally time-consuming task. To circumvent this, data augmentation is applied to generate more data based on a smaller set of already existing text data. Softronic AB wants to investigate alternative strategies for data augmentation with the aim of possibly replacing the current solution with a more scientifically substantiated one. In this thesis, prototype models have been developed to compare and evaluate the effect of different text augmentation strategies. The results of conducted experiments with the prototype models show that augmentation through synonym swaps with a domain-adapted thesaurus, presented noticeably improved effects on the ability of an NLU-model to correctly classify data, compared to other evaluated strategies. Furthermore, the result indicates that there is a relationship between the structural degree of variation of the augmented data and the applied language pair's semantic degree of similarity during back-translations.

Page generated in 0.0413 seconds