• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 140
  • 5
  • 4
  • 2
  • 2
  • 1
  • 1
  • Tagged with
  • 165
  • 95
  • 80
  • 69
  • 67
  • 52
  • 50
  • 48
  • 47
  • 47
  • 46
  • 45
  • 45
  • 42
  • 42
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
101

Extracting Structured Data from Free-Text Clinical Notes : The impact of hierarchies in model training / Utvinna strukturerad data från fri-text läkaranteckningar : Påverkan av hierarkier i modelträning

Omer, Mohammad January 2021 (has links)
Diagnosis code assignment is a field that looks at automatically assigning diagnosis codes to free-text clinical notes. Assigning a diagnosis code to clinical notes manually needs expertise and time. Being able to do this automatically makes getting structured data from free-text clinical notes in Electronic Health Records easier. Furthermore, it can also be used as decision support for clinicians where they can input their notes and get back diagnosis codes as a second opinion. This project investigates the effects of using the hierarchies the diagnosis codes are structured in when training the diagnosis code assignment models compared to models trained with a standard loss function, binary cross-entropy. This has been done by using the hierarchy of two systems of diagnosis codes, ICD-9 and SNOMED CT, where one hierarchy is more detailed than the other. The results showed that hierarchical training increased the recall of the models regardless of what hierarchy was used. The more detailed hierarchy, SNOMED CT, increased the recall more than what the use of the less detailed ICD-9 hierarchy did. However, when using the more detailed SNOMED CT hierarchy the precision of the models decreased while the differences in precision when using the ICD-9 hierarchy was not statistically significant. The increase in recall did not make up for the decrease in precision when training with the SNOMED CT hierarchy when looking at the F1-score that is the harmonic mean of the two metrics. The conclusions from these results are that using a more detailed hierarchy increased the recall of the model more than when using a less detailed hierarchy. However, the overall performance measured in F1-score decreased when using a more detailed hierarchy since the other metric, precision, decreased by more than what recall increased. The use of a less detailed hierarchy maintained its precision giving an increase in overall performance. / Diagnoskodstilldeling är ett fält som undersöker hur man automatiskt kan tilldela diagnoskoder till fri-text läkaranteckningar. En manuell tildeling kräver expertis och mycket tid. Förmågan att göra detta automatiskt förenklar utvinning av strukturerad data från fri-text läkaranteckningar i elektroniska patientjournaler. Det kan även användas som ett hjälpverktyg för läkare där de kan skriva in sina läkaranteckningar och få tillbaka diagnoskoder som en andra åsikt. Detta arbete undersöker effekterna av att ta användning av hierarkierna diagnoskoderna är strukturerade i när man tränar modeller för diagnoskodstilldelning jämfört med att träna modellerna med en vanlig loss-funktion. Det här kommer att göras genom att använda hierarkierna av två diagnoskod-system, SNOMED CT och ICD-9, där en av hierarkierna är mer detaljerad. Resultaten visade att hierarkisk träning ökade recall för modellerna med båda hierarkierna. Den mer detaljerade hierarkien, SNOMED CT, gav en högre ökning än vad träningen med ICD-9 gjorde. Trots detta minskade precision av modellen när man den tränades med SNOMED CT hierarkin medan skillnaderna i precision när man tränade hierarkiskt med ICD-9 jämfört med vanligt inte var statistiskt signifikanta. Ökningen i recall kompenserade inte för minskningen i precision när modellen tränades med SNOMED CT hierarkien som man kan see på F1-score vilket är det harmoniska medelvärdet av de recall och precision. Slutsatserna man kan dra från de här resultaten är att en mer detaljerad hierarki kommer att öka recall mer än en mindre detaljerad hierarki ökar recall. Trots detta kommer den totala prestandan, som mäts av F1-score, försämras med en mer detaljerad hierarki eftersom att recall minskar mer än vad precision ökar. En mindre detaljerad hierarki i träning kommer bibehålla precision så att dens totala prestandan förbättras.
102

Prerequisites for Extracting Entity Relations from Swedish Texts

Lenas, Erik January 2020 (has links)
Natural language processing (NLP) is a vibrant area of research with many practical applications today like sentiment analyses, text labeling, questioning an- swering, machine translation and automatic text summarizing. At the moment, research is mainly focused on the English language, although many other lan- guages are trying to catch up. This work focuses on an area within NLP called information extraction, and more specifically on relation extraction, that is, to ex- tract relations between entities in a text. What this work aims at is to use machine learning techniques to build a Swedish language processing pipeline with part-of- speech tagging, dependency parsing, named entity recognition and coreference resolution to use as a base for later relation extraction from archival texts. The obvious difficulty lies in the scarcity of Swedish annotated datasets. For exam- ple, no large enough Swedish dataset for coreference resolution exists today. An important part of this work, therefore, is to create a Swedish coreference solver using distantly supervised machine learning, which means creating a Swedish dataset by applying an English coreference solver on an unannotated bilingual corpus, and then using a word-aligner to translate this machine-annotated En- glish dataset to a Swedish dataset, and then training a Swedish model on this dataset. Using Allen NLP:s end-to-end coreference resolution model, both for creating the Swedish dataset and training the Swedish model, this work achieves an F1-score of 0.5. For named entity recognition this work uses the Swedish BERT models released by the Royal Library of Sweden in February 2020 and achieves an overall F1-score of 0.95. To put all of these NLP-models within a single Lan- guage Processing Pipeline, Spacy is used as a unifying framework. / Natural Language Processing (NLP) är ett stort och aktuellt forskningsområde idag med många praktiska tillämpningar som sentimentanalys, textkategoriser- ing, maskinöversättning och automatisk textsummering. Forskningen är för när- varande mest inriktad på det engelska språket, men många andra språkområ- den försöker komma ikapp. Det här arbetet fokuserar på ett område inom NLP som kallas informationsextraktion, och mer specifikt relationsextrahering, det vill säga att extrahera relationer mellan namngivna entiteter i en text. Vad det här ar- betet försöker göra är att använda olika maskininlärningstekniker för att skapa en svensk Language Processing Pipeline bestående av part-of-speech tagging, de- pendency parsing, named entity recognition och coreference resolution. Denna pipeline är sedan tänkt att användas som en bas for senare relationsextrahering från svenskt arkivmaterial. Den uppenbara svårigheten med detta ligger i att det är ont om stora, annoterade svenska dataset. Till exempel så finns det inget till- räckligt stort svenskt dataset för coreference resolution. En stor del av detta arbete går därför ut på att skapa en svensk coreference solver genom att implementera distantly supervised machine learning, med vilket menas att använda en engelsk coreference solver på ett oannoterat engelskt-svenskt corpus, och sen använda en word-aligner för att översätta detta maskinannoterade engelska dataset till ett svenskt, och sen träna en svensk coreference solver på detta dataset. Det här arbetet använder Allen NLP:s end-to-end coreference solver, både för att skapa det svenska datasetet, och för att träna den svenska modellen, och uppnår en F1-score på 0.5. Vad gäller named entity recognition så använder det här arbetet Kungliga Bibliotekets BERT-modeller som bas, och uppnår genom detta en F1- score på 0.95. Spacy används som ett enande ramverk för att samla alla dessa NLP-komponenter inom en enda pipeline.
103

Design of a Robust and Flexible Grammar for Speech Control

Ludyga, Tomasz 28 May 2024 (has links)
Voice interaction is an established automatization and accessibility feature. While many satisfactory speech recognition solutions are available today, the interpretation of text se-mantic is in some use-cases difficult. Differentiated can be two types of text semantic ex-traction models: probabilistic and pure rule-based. Rule-based reasoning is formalizable into grammars and enables fast language validation, transparent decision-making and easy customization. In this thesis we develop a context-free ANTLR semantic grammar to control software by speech in a medical, smart glasses related, domain. The implementation is preceded by research of state-of-the-art, requirements consultation and a thorough design of reusable system abstractions. Design includes definitions of DSL, meta grammar, generic system ar-chitecture and tool support. Additionally, we investigate trivial and experimental grammar improvement techniques. Due to multifaceted flexibility and robustness of the designed framework, we indicate its usability in critical and adaptive systems. We determine 75% semantic recognition accuracy in the medical main use-case. We compare it against se-mantic extraction using SpaCy and two fine-tuned AI classifiers. The evaluation reveals high accuracy of BERT for sequence classification and big potential of hybrid solutions with AI techniques on top grammars, essentially for detection of alerts. The accuracy is strong dependent on input quality, highlighting the importance of speech recognition tailored to specific vocabulary.:1 Introduction 1 1.1 Motivation 1 1.2 CAIS.ME Project 2 1.3 Problem Statement 2 1.4 Thesis Overview 3 2 Related Work 4 3 Foundational Concepts and Systems 6 3.1 Human-Computer Interaction in Speech 6 3.2 Speech Recognition 7 3.2.1 Open-source technologies 8 3.2.2 Other technologies 9 3.3 Language Recognition 9 3.3.1 Regular expressions 10 3.3.2 Lexical tokenization 10 3.3.3 Parsing 10 3.3.4 Domain Specific Languages 11 3.3.5 Formal grammars 11 3.3.6 Natural Language Processing 12 3.3.7 Model-Driven Engineering 14 4 State-of-the-Art: Grammars 15 4.1 Overview 15 4.2 Workbenches for Grammar Design 16 4.2.1 ANTLR 16 4.2.2 Xtext 17 4.2.3 JetBrains MPS 17 4.2.4 Other tools 18 4.3 Design Approaches 19 5 Problem Analysis 23 5.1 Methodology 23 5.2 Identification of Use-Cases 24 5.3 Requirements Analysis 26 5.3.1 Functional requirements 26 5.3.2 Qualitative requirements 26 5.3.3 Acceptance criteria 27 6 Design 29 6.1 Preprocessing 29 6.2 Underlying Domain Specific Modelling 31 6.2.1 Language model definition 31 6.2.2 Formalization 32 6.2.3 Constraints 32 6.3 Generic Grammar Syntax 33 6.4 Architecture 36 6.5 Integration of AI Techniques 38 6.6 Grammar Improvement 40 6.6.1 Identification of synonyms 40 6.6.2 Automatic addition of synonyms 42 6.6.3 Addition of same-meaning strings 42 6.6.4 Addition and modification of rules 43 6.7 Processing of unrecognized input 44 6.8 Summary 45 7 Implementation and Evaluation 47 7.1 Development Environment 47 7.2 Implementation 48 7.2.1 Grammar model transformation 48 7.2.2 Output construction 50 7.2.3 Testing 50 7.2.4 Reusability for similar use-cases 51 7.3 Limitations and Challenges 52 7.4 Comparison to NLP Solutions 54 8 Conclusion 58 8.1 Summary of Findings 58 8.2 Future Research and Development 60 Acronyms 62 Bibliography 63 List of Figures 73 List of Tables 74 List of Listings 75
104

Community Recommendation in Social Networks with Sparse Data

Emad Rahmaniazad (9725117) 07 January 2021 (has links)
Recommender systems are widely used in many domains. In this work, the importance of a recommender system in an online learning platform is discussed. After explaining the concept of adding an intelligent agent to online education systems, some features of the Course Networking (CN) website are demonstrated. Finally, the relation between CN, the intelligent agent (Rumi), and the recommender system is presented. Along with the argument of three different approaches for building a community recommendation system. The result shows that the Neighboring Collaborative Filtering (NCF) outperforms both the transfer learning method and the Continuous bag-of-words approach. The NCF algorithm has a general format with two various implementations that can be used for other recommendations, such as course, skill, major, and book recommendations.
105

Recognising Moral Foundations in Online Extremist Discourse : A Cross-Domain Classification Study

van Luenen, Anne Fleur January 2020 (has links)
So far, studies seeking to recognise moral foundations in texts have been relatively successful (Araque et al., 2019; Lin et al., 2018; Mooijman et al., 2017; Rezapouret al., 2019). There are, however, two issues with these studies: Firstly, it is an extensive process to gather and annotate sufficient material for training. Secondly, models are only trained and tested within the same domain. It is yet unexplored how these models for moral foundation prediction perform when tested in other domains, but from their experience with annotation, Hoover et al. (2017) describe how moral sentiments on one topic (e.g. black lives matter) might be completely different from moral sentiments on another (e.g. presidential elections). This study attempts to explore to what extent models generalise to other domains. More specifically, we focus on training on Twitter data from non-extremist sources, and testing on data from an extremist (white nationalist) forum. We conducted two experiments. In our first experiment we test whether it is possible to do cross domain classification of moral foundations. Additionally, we compare the performance of a model using the Word2Vec embeddings used in previous studies to a model using the newer BERT embeddings. We find that although the performance drops significantly on the extremist out-domain test sets, out-domain classification is not impossible. Furthermore, we find that the BERT model generalises marginally better to the out-domain test set, than the Word2Vec model. In our second experiment we attempt to improve the generalisation to extremist test data by providing contextual knowledge. Although this does not improve the model, it does show the model’s robustness against noise. Finally we suggest an alternative approach for accounting for contextual knowledge.
106

Klasifikace vztahů mezi pojmenovanými entitami v textu / Classification of Relations between Named Entities in Text

Ondřej, Karel January 2020 (has links)
This master thesis deals with the extraction of relationships between named entities in the text. In the theoretical part of the thesis, the issue of natural language representation for machine processing is discussed. Subsequently, two partial tasks of relationship extraction are defined, namely named entities recognition and classification of relationships between them, including a summary of state-of-the-art solutions. In the practical part of the thesis, system for automatic extraction of relationships between named entities from downloaded pages is designed. The classification of relationships between entities is based on the pre-trained transformers. In this thesis, four pre-trained transformers are compared, namely BERT, XLNet, RoBERTa and ALBERT.
107

Parafrasidentifiering med maskinklassificerad data : utvärdering av olika metoder / Paraphrase identification with computer classified paraphrases : An evaluation of different methods

Johansson, Oskar January 2020 (has links)
Detta arbete undersöker hur språkmodellen BERT och en MaLSTM-arkitektur fungerar att för att identifiera parafraser ur 'Microsoft Paraphrase Research Corpus' (MPRC) om dessa tränats på automatiskt identifierade parafraser ur 'Paraphrase Database' (PPDB). Metoderna ställs mot varandra för att undersöka vilken som presterar bäst och metoden att träna på maskinklassificerad data för att användas på mänskligt klassificerad data utvärderas i förhållande till annan klassificering av samma dataset. Meningsparen som används för att träna modellerna hämtas från de högst rankade parafraserna ur PPDB och genom en genereringsmetod som skapar icke-parafraser ur samma dataset. I resultatet visar sig BERT vara kapabel till att identifiera en del parafraser ur MPRC, medan MaLSTM-arkitekturen inte klarade av detta trots förmåga att särskilja på parafraser och icke-parafraser under träning. Både BERT och MaLSTM presterade sämre på att identifiera parafraser ur MPRC än modeller som till exempel StructBERT, som tränat och utvärderats på samma dataset, presterar. Anledningar till att MaLSTM inte klarar av uppgiften diskuteras och främst lyfts att meningarna från icke-parafraserna ur träningsdatan är för olika varandra i förhållande till hur de ser ut i MPRC. Slutligen diskuteras vikten av att forska vidare på hur man kan använda sig av maskinframtagna parafraser inom parafraseringsrelaterad forskning.
108

Sentiment Analysis of Financial News with Supervised Learning

Syeda, Farha Shazmeen January 2020 (has links)
Financial data in banks are unstructured and complicated. It is challenging to analyze these texts manually due to the small amount of labeled training data in financial text. Moreover, the financial text consists of language in the economic domain where a general-purpose model is not efficient. In this thesis, data had collected from MFN (Modular Finance) financial news, this data is scraped and persisted in the database and price indices are collected from Bloomberg terminal. Comprehensive study and tests are conducted to find the state-of-art results for classifying the sentiments using traditional classifiers like Naive Bayes and transfer learning models like BERT and FinBERT. FinBERT outperform the Naive Bayes and BERT classifier. The time-series indices for sentiments are built, and their correlations with price indices calculated using Pearson correlation. Augmented Dickey-Fuller (ADF) is used to check if both the time series data are stationary. Finally, the statistical hypothesis Granger causality test determines if the sentiment time series helps predict price. This result shows that there is a significant correlation and causal relation between sentiments and price.
109

Predicting Political Party Affiliation in the Swedish Parliament using Natural Language Processing

Zetterberg, Johannes January 2022 (has links)
Text classification is a fundamental part of natural language processing. In this thesis, methods for text classification are used in an attempt to predict the political party affiliation of members of parliament (MPs). The objective is to evaluate the performance of Support Vector Machines (SVM), naive Bayes, and a fine-tuned Bidirectional Encoder Representations from Transformers (BERT) model in predicting MPs' political party affiliation based on speeches given in the Chamber of the Swedish Parliament. This study shows that BERT outperforms SVM and naive Bayes in correctly classifying MPs, and SVM makes better predictions than naive Bayes and performs reasonably well compared to BERT. The results show that all models correctly predict MPs representing the Sweden Democrats to the highest degree. Both BERT and SVM roughly classify every other speech correctly, which implies much better than making random predictions. These results indicate the potential use of methods for automatically classifying political speeches.
110

Object Classification using Language Models

From, Gustav January 2022 (has links)
In today’s modern digital world more and more emails and messengers must be sent, processed and handled. The categorizing and classification of these text pieces can take an incredibly long time and will cost the company a lot of time and money. If the classification could be done automatically by a computer dependent on the content of the text/message it would result in a major yield for the Easit AB and its customers. In order to facilitate the task of text-classification Easit needs a solution that is made out of one language model and one classifier model. The language model will convert raw text to a vector that is representative of the text and the classifier will construe what predefined labels fit for the vector. The end goal is not to create the best solution. It is simply to create a general understanding about different language and classifier models and how to build a system that will be both fast and accurate. BERT were the primary language model during evaluation but doc2Vec and One-Hot encoding was also tested. The classifier consisted out of boundary condition models or dense neural networks that were all trained without knowledge about what language model that the text vectors came from. The validation accuracy which was presented for the IMDB-comment dataset with BERT resulted between 75% to 94%, mostly dependent on the language model and not on the classifier. The knowledge from the work resulted in a recommendation to Easit for an alternativebased system solution. / I dagens moderna digitala värld är det allt mer majl-ärenden och meddelanden som ska skickas och processeras. Kategorisering och klassificering av dessa kan ta otroligt lång tid och kostar företag tid samt pengar. Om klassifieringen kunde ske automatiskt beroende på text-innehållet skulle det innebära en stor vinst för Easit AB och deras kunder.  För att underlätta arbetet med text-klassifiering behöver Easit en tvådelad lösning som består utav en språkmodell och en klassifierare. Språkmodellen som omvandlar text till en vektor som representerar texten och klassifieraren tolkar vilka fördefinerade ettiketter/märken som passar för vektorn. Målet är inte att skapa den bästa lösningen utan det är att skapa en generell kunskap för hur man kan utforma ett system som kan klassifiera texten på ett träffsäkert och effektivt sätt. Vid utvärdering av olika språkmodeller användes framförallt BERT-modeller men även doc2Vec och One-Hot testas också. Klassifieraren bestod utav gränsvillkors-modeller eller dense neurala nätverk som tränades helt utan vetskap om vilken språkmodell som skickat text-vektorerna. Träffsäkerheten som uppvisades vid validering för IMDB-kommentars datasetet med BERT blev mellan 75% till 94%, primärt beroende på språkmodellen. De neuralt nätverk passar bäst som klassifierare mest på grund av deras skalbarhet med flera ettiketter. Kunskapen från arbetet resulterade i en rekommendation till Easit om en alternativbaserad systemlösning.

Page generated in 0.0319 seconds