Spelling suggestions: "subject:"[een] NLP"" "subject:"[enn] NLP""
321 |
Using Natural Language Processing and Machine Learning for Analyzing Clinical Notes in Sickle Cell Disease PatientsKhizra, Shufa January 2018 (has links)
No description available.
|
322 |
Annotating Job Titles in Job Ads using Swedish Language ModelsRidhagen, Markus January 2023 (has links)
This thesis investigates automated annotation approaches to assist public authorities in Sweden in optimizing resource allocation and gaining valuable insights to enhance the preservation of high-quality welfare. The study uses pre-trained Swedish language models for the named entity recognition (NER) task of finding job titles in job advertisements from The Swedish Public Employment Service, Arbetsförmedlingen. Specifically, it evaluates the performance of the Swedish Bidirectional Encoder Representations from Transformers (BERT), developed by the National Library of Sweden (KB), referred to as KB-BERT. The thesis explores the impact of training data size on the models’ performance and examines whether active learning can enhance efficiency and accuracy compared to random sampling. The findings reveal that even with a small training dataset of 220 job advertisements, KB-BERT achieves a commendable F1-score of 0.770 in predicting job titles. The model’s performance improves further by augmenting the training data with an additional 500 annotated job advertisements, yielding an F1-score of 0.834. Notably, the highest F1-score of 0.856 is achieved by applying the active learning strategy of uncertainty sampling and the measure of mean entropy. The test data provided by Arbetsförmedlingen was re-annotated to evaluate the complexity of the task. The human annotator achieved an F1-score of 0.883. Based on these findings, it can be inferred that KB-BERT performs satisfactorily in classifying job titles from job ads.
|
323 |
Embodied MetarepresentationsHinrich, Nicolás, Foradi, Maryam, Yousef, Tariq, Hartmann, Elisa, Triesch, Susanne, Kaßel, Jan, Pein, Johannes 06 June 2023 (has links)
Meaning has been established pervasively as a central concept throughout disciplines
that were involved in cognitive revolution. Its metaphoric usage comes to be, first and
foremost, through the interpreter’s constraint: representational relationships and contents
are considered to be in the “eye” or mind of the observer and shared properties
among observers themselves are knowable through interlinguistic phenomena, such
as translation. Despite the instability of meaning in relation to its underdetermination
by reference, it can be a tertium comparationis or “third comparator” for extended
human cognition if gauged through invariants that exist in transfer processes such as
translation, as all languages and cultures are rooted in pan-human experience and, thus,
share and express species-specific ontology. Meaning, seen as a cognitive competence,
does not stop outside of the body but extends, depends, and partners with other
agents and the environment. A novel approach for exploring the transfer properties
of some constituent items of the original natural semantic metalanguage in English,
that is, semantic primitives, is presented: FrameNet’s semantic frames, evoked by the
primes SEE and FEEL, were extracted from EuroParl, a parallel corpus that allows for
the automatic word alignment of items with their synonyms. Large Ontology Multilingual
Extraction was used. Afterward, following the Semantic Mirrors Method, a procedure
that consists back-translating into source language, a translatological examination of
translated and original versions of items was performed. A fully automated pipeline
was designed and tested, with the purpose of exploring associated frame shifts and,
thus, beginning a research agenda on their alleged universality as linguistic features of
translation, which will be complemented with and contrasted against further massive
feedback through a citizen science approach, as well as cognitive and neurophysiological
examinations. Additionally, an embodied account of frame semantics is proposed.
|
324 |
BERTie Bott’s Every Flavor Labels : A Tasty Guide to Developing a Semantic Role Labeling Model for GalicianBruton, Micaella January 2023 (has links)
For the vast majority of languages, Natural Language Processing (NLP) tools are either absent entirely, or leave much to be desired in their final performance. Despite having nearly 4 million speakers, one such low-resource language is Galician. In an effort to expand available NLP resources, this project sought to construct a dataset for Semantic Role Labeling (SRL) and produce a baseline for future research to use in comparisons. SRL is a task which has shown success in amplifying the final output for various NLP systems, including Machine Translation and other interactive language models. This project was successful in that fact and produced 24 SRL models and two SRL datasets; one Galician and one Spanish. mBERT and XLM-R were chosen as the baseline architectures; additional models were first pre-trained on the SRL task in a language other than the target to measure the effects of transfer-learning. Scores are reported on a scale of 0.0-1.0. The best performing Galician SRL model achieved an f1 score of 0.74, introducing a baseline for future Galician SRL systems. The best performing Spanish SRL model achieved an f1 score of 0.83, outperforming the baseline set by the 2009 CoNLL Shared Task by 0.025. A pre-processing method, verbal indexing, was also introduced which allowed for increased performance in the SRL parsing of highly complex sentences; effects were amplified in scenarios where the model was both pre-trained and fine-tuned on datasets utilizing the method, but still visible even when only used during fine-tuning. / För de allra flesta språken saknas språkteknologiska verktyg (NLP) helt, eller för dem de var i finns tillgängliga är dessa verktygs prestanda minst sagt, sämre än medelmåttig. Trots sina nästan 4 miljoner talare, är galiciska ett språk med brist på tillräckliga resurser. I ett försök att utöka tillgängliga NLP-resurser för språket, konstruerades i detta projekt en uppsättning data för så kallat Semantic Role Labeling (SRL) som sedan användes för att utveckla grundläggande SRL-modeller att falla tillbaka på och jämföra med i framtida forskning. SRL är en uppgift som har visat framgång när det gäller att förstärka slutresultatet för olika NLP-system, inklusive maskinöversättning och andra interaktiva språkmodeller. I detta avseende visade detta projekt på framgång och som del av det utvecklades 24 SRL-modeller och två SRL-datauppsåttningar; en galicisk och en spansk. mBERT och XLM-R valdes som baslinjearkitekturer; ytterligare modeller tränades först på en SRL-uppgift på ett språk annat än målspråket för att mäta effekterna av överföringsinlärning (Transfer Learning) Poäng redovisas på en skala från 0.0-1.0. Den galiciska SRL-modellen med bäst prestanda uppnådde ett f1-poäng på 0.74, vilket introducerar en baslinje för framtida galiciska SRL-system. Den bästa spanska SRL-modellen uppnådde ett f1-poäng på 0.83, vilket överträffade baslinjen +0.025 som sattes under CoNLL Shared Task 2009. I detta projekt introduceras även en ny metod för behandling av lingvistisk data, så kallad verbalindexering, som ökade prestandan av mycket komplexa meningar. Denna prestandaökning först märktes ytterligare i de scenarier och är en modell både förtränats och finjusterats på uppsättningar data som behandlats med metoden, men visade även på märkbara förbättringar då en modell endast genomgått finjustering. / Para la gran mayoría de los idiomas, las herramientas de procesamiento del lenguaje natural (NLP) están completamente ausentes o dejan mucho que desear en su desempeño final. A pesar de tener casi 4 millones de hablantes, el gallego continúa siendo un idioma de bajos recursos. En un esfuerzo por expandir los recursos de NLP disponibles, el objetivo de este proyecto fue construir un conjunto de datos para el Etiquetado de Roles Semánticos (SRL) y producir una referencia para que futuras investigaciones puedan utilizar en sus comparaciones. SRL es una tarea que ha tenido éxito en la amplificación del resultado final de varios sistemas NLP, incluida la traducción automática, y otros modelos de lenguaje interactivo. Este proyecto fue exitoso en ese hecho y produjo 24 modelos SRL y dos conjuntos de datos SRL; uno en gallego y otro en español. Se eligieron mBERT y XLM-R como las arquitecturas de referencia; previamente se entrenaron modelos adicionales en la tarea SRL en un idioma distinto al idioma de destino para medir los efectos del aprendizaje por transferencia. Las puntuaciones se informan en una escala de 0.0 a 1.0. El modelo SRL gallego con mejor rendimiento logró una puntuación de f1 de 0.74, introduciendo un objetivo de referencia para los futuros sistemas SRL gallegos. El modelo español de SRL con mejor rendimiento logró una puntuación de f1 de 0.83, superando la línea base establecida por la Tarea Compartida CoNLL de 2009 en 0.025. También se introdujo un método de preprocesamiento, indexación verbal, que permitió un mayor rendimiento en el análisis SRL de oraciones muy complejas; los efectos se amplificaron cuando el modelo primero se entrenó y luego se ajustó con los conjuntos de datos que utilizaban el método, pero los efectos aún fueron visibles incluso cuando se lo utilizó solo durante el ajuste.
|
325 |
Streamline searches in a database / Effektivisera sökningar in en databasEllerblad Valtonen, David, Franzén, André January 2023 (has links)
The objective of this thesis is to explore technologies and solutions and see if it is possible to make a logistical flow more efficient. The logistical flow consists of a database containing materiel for purchase or reparation. As of now, searches may either result in too many results, of which several are irrelevant, or no results at all. The search needs to be very specific to retrieve the exact item, which requires extensive knowledge about the database and its contents. Areas that will be explored include Natural Language Processing and Machine Learning techniques. To solve this, a literature study will be conducted to gain insights into existing work and possible solutions. Exploratory Data Analysis will be used to understand the patterns and limitations of the data.
|
326 |
Classification of invoices using a 2D NLP approach : A comparison between methods for invoice information extraction for the purpose of classification / Klassificering av fakturor med 2-dimensionell naturligtspråkbehandling : En jämförelse av metoder för extrahering av nyckelinformation från fakturor i klassificeringssyfteFredriksson, Linnéa January 2023 (has links)
Many companies are handling a large number of invoices every year. To manually categorize them takes a lot of time and resources. For a model to automatically categorize invoices, the documents need to be properly read and processed by the model. While traditional Natural Language Processing may be suitable for processing structured documents, unstructured documents such as invoices often need the layout to be considered in ordered for the document to be read correctly. Techniques that take the visual information in account when processing a document is referred to as 2D NLP. One of such models that is state-of-the-art today is LayoutLMv3. This project provides a comparison of invoice-information extraction using LayoutLMv3 and plain Optical Character Recognition (OCR) for the purpose of invoice classification. LayoutLMv3 were fine-tuned for key-field extraction on 180 annotated invoices. The extracted key-fields were then used to form 3 different configurations of structured text-strings for each document. The structured texts were used for training a classification model into three categories, A: physical product, B: service and C: unknown. The results were compared with a baseline classification model trained on unstructured text obtained through OCR. The results show that all of the models achieved equal performance on the classification task. However, several inconsistencies regarding the annotations of the dataset were found. The project concluded that the raw OCR text proved to be useful for classification despite being unstructured, and that similar classification results could be obtained through considering only a few key-information fields. Obtaining a structured input through LayoutLMv3 proved to be especially useful for controlling the input to the classification model, such as omitting undesirable information. However, the drawbacks might be that some important information in some cases are excluded. / Många företag hanterar en stor mängd fakturor varje år. Att manuellt klassificera dessa in i olika kategorier tar mycket tid och resurser. För en modell som automatiskt ska klassificera fakturor krävs att informationen i dokumenten blir korrekt representerad och hanterad av modellen. Medan naturligtspråkbehandling (NLP) är en lämplig metod för att hantera strukturerade dokument, behöver ostrukturerade dokument ofta hanteras med en metod som även bevarar den visuella informationen på sidan för att dokumentet ska läsas korrekt. Tekniker som gör detta kallas för 2-dimensionell naturligtspråkbehandling. En modell som använder sig av en sådan teknik är LayoutLMv3, som innehar dagens högsta nivå av resultat. Det här projektet utför en jämförelse av metoder för extrahering av information från fakturor med avsikt att användas för klassificering. Extrahering av nyckelinformation med hjälp av LayoutLMv3 jämförs med användning av optisk teckenigenkänning (OCR). LayoutLMv3 finjusterades för nyckelfältsextraktion av 12 informationsfält. Därefter formaterades den extraherade nyckelinformationen från hela datasetet till tre olika strukturerade text-inmatningar. De strukturerade texterna användes sedan för att träna en klassificeringsmodell på tre kategorier, A: fysisk produkt, B: tjänst, och C: okänt. Resultaten jämfördes med en basmodell tränad på den ostrukturerade texten från OCR. Resultaten visar att alla modellerna presterar lika bra. Emellertid påträffades några olyckliga inkonsekvenser i den utförda annoteringen av fakturorna. Projektets slutsats är att den råa OCR texten visar sig vara användbar för klassificeringen trots att det är en ostrukturerad representation av dokumenten, men att liknande resultat kan fås vid användning av bara ett fåtal nyckelfält. Användning av den strukturerade texten från LayoutLMv3 visade sig vara särskilt behändig för att kontrollera inmatningen till klassificeringsmodellen, såsom att exkludera viss information. Däremot kan det vara en nackdel att viss information i somliga fall blir förbisedd.
|
327 |
Förbehandling och Hantering av Användarmärkningar på E-handelsartiklar / Preprocessing and Treatment of User Tags on E-commerce ArticlesJohansson, Viktor January 2023 (has links)
Plick is an online platform with the intention of being a marketplace where users may buy and sell second-hand fashion. The platform caters to younger users, and as such borrows many ideas from well-known social network platforms - such as putting more focus on user profiles and expression, rather than just the products themselves. One of these ideas is to allow users free reign over tagging their items, rather than having them select from some constrained, pre-approved, group of categories, styles, sizes - et cetera. A problem of letting users tag products however they see fit is that a subset of users will inevitably try to 'game' the system by knowingly tagging their products using incorrect labels - resulting in inaccurate search results for many of these incorrect tags.The aim of this project is to firstly develop a pre-processing algorithm to normalize the user generated tagging data - to handle situations such as a tag having multiple different (albeit possibly all correct) spellings, capitalizations, typos, languages etc. The processed data will then be used to develop two different approaches to solve the problem of incorrect tagging. The first approach involves using the normalized data to create a graph representation of the tags and their relations to each other. Each node in the graph will represent an individual tag, and each edge between nodes will explain how closely related those two tags are. An algorithm will then be developed to, utilizing the tag relation graph, describe the relatedness of an arbitrary group of tags. The algorithm should also be able to identify any tags that are outliers among the group. The second approach entails the development of a gaussian naive bayes classifier, with the goal of identifying whether an article is anomalistic or not - given the group of tags it's been assigned.
|
328 |
Characterizing, classifying and transforming language model distributionsKniele, Annika January 2023 (has links)
Large Language Models (LLMs) have become ever larger in recent years, typically demonstrating improved performance as the number of parameters increases. This thesis investigates how the probability distributions output by language models differ depending on the size of the model. For this purpose, three features for capturing the differences between the distributions are defined, namely the difference in entropy, the difference in probability mass in different slices of the distribution, and the difference in the number of tokens covering the top-p probability mass. The distributions are then put into different distribution classes based on how they differ from the distributions of the differently-sized model. Finally, the distributions are transformed to be more similar to the distributions of the other model. The results suggest that classifying distributions before transforming them, and adapting the transformations based on which class a distribution is in, improves the transformation results. It is also shown that letting a classifier choose the class label for each distribution yields better results than using random labels. Furthermore, the findings indicate that transforming the distributions using entropy and the number of tokens in the top-p probability mass makes the distributions more similar to the targets, while transforming them based on the probability mass of individual slices of the distributions makes the distributions more dissimilar.
|
329 |
Clustering and Summarization of Chat Dialogues : To understand a company’s customer base / Klustring och Summering av Chatt-DialogerHidén, Oskar, Björelind, David January 2021 (has links)
The Customer Success department at Visma handles about 200 000 customer chats each year, the chat dialogues are stored and contain both questions and answers. In order to get an idea of what customers ask about, the Customer Success department has to read a random sample of the chat dialogues manually. This thesis develops and investigates an analysis tool for the chat data, using the approach of clustering and summarization. The approach aims to decrease the time spent and increase the quality of the analysis. Models for clustering (K-means, DBSCAN and HDBSCAN) and extractive summarization (K-means, LSA and TextRank) are compared. Each algorithm is combined with three different text representations (TFIDF, S-BERT and FastText) to create models for evaluation. These models are evaluated against a test set, created for the purpose of this thesis. Silhouette Index and Adjusted Rand Index are used to evaluate the clustering models. ROUGE measure together with a qualitative evaluation are used to evaluate the extractive summarization models. In addition to this, the best clustering model is further evaluated to understand how different data sizes impact performance. TFIDF Unigram together with HDBSCAN or K-means obtained the best results for clustering, whereas FastText together with TextRank obtained the best results for extractive summarization. This thesis applies known models on a textual domain of customer chat dialogues, something that, to our knowledge, has previously not been done in literature.
|
330 |
Homograph Disambiguation and Diacritization for Arabic Text-to-Speech Using Neural Networks / Homografdisambiguering och diakritisering för arabiska text-till-talsystem med hjälp av neurala nätverkLameris, Harm January 2021 (has links)
Pre-processing Arabic text for Text-to-Speech (TTS) systems poses major challenges, as Arabic omits short vowels in writing. This omission leads to a large number of homographs, and means that Arabic text needs to be diacritized to disambiguate these homographs, in order to be matched up with the intended pronunciation. Diacritizing Arabic has generally been achieved by using rule-based, statistical, or hybrid methods that combine rule-based and statistical methods. Recently, diacritization methods involving deep learning have shown promise in reducing error rates. These deep-learning methods are not yet commonly used in TTS engines, however. To examine neural diacritization methods for use in TTS engines, we normalized and pre-processed a version of the Tashkeela corpus, a large diacritized corpus containing largely Classical Arabic texts, for TTS purposes. We then trained and tested three state-of-the-art Recurrent-Neural-Network-based models on this data set. Additionally we tested these models on the Wiki News corpus, a test set that contains Modern Standard Arabic (MSA) news articles and thus more closely resembles most TTS queries. The models were evaluated by comparing the Diacritic Error Rate (DER) and Word Error Rate (WER) achieved for each data set to one another and to the DER and WER reported in the original papers. Moreover, the per-diacritic accuracy was examined, and a manual evaluation was performed. For the Tashkeela corpus, all models achieved a lower DER and WER than reported in the original papers. This was largely the result of using more training data in addition to the TTS pre-processing steps that were performed on the data. For the Wiki News corpus, the error rates were higher, largely due to the domain gap between the data sets. We found that for both data sets the models overfit on common patterns and the most common diacritic. For the Wiki News corpus the models struggled with Named Entities and loanwords. Purely neural models generally outperformed the model that combined deep learning with rule-based and statistical corrections. These findings highlight the usability of deep learning methods for Arabic diacritization in TTS engines as well as the need for diacritized corpora that are more representative of Modern Standard Arabic.
|
Page generated in 0.0615 seconds