• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 83
  • 22
  • 20
  • 11
  • 5
  • 3
  • 2
  • 2
  • 2
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • Tagged with
  • 176
  • 124
  • 107
  • 63
  • 54
  • 50
  • 49
  • 48
  • 38
  • 32
  • 30
  • 28
  • 27
  • 20
  • 19
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
91

Klasifikace vztahů mezi pojmenovanými entitami v textu / Classification of Relations between Named Entities in Text

Ondřej, Karel January 2020 (has links)
This master thesis deals with the extraction of relationships between named entities in the text. In the theoretical part of the thesis, the issue of natural language representation for machine processing is discussed. Subsequently, two partial tasks of relationship extraction are defined, namely named entities recognition and classification of relationships between them, including a summary of state-of-the-art solutions. In the practical part of the thesis, system for automatic extraction of relationships between named entities from downloaded pages is designed. The classification of relationships between entities is based on the pre-trained transformers. In this thesis, four pre-trained transformers are compared, namely BERT, XLNet, RoBERTa and ALBERT.
92

Rozpoznávání pojmenovaných entit / Named Entity Recognition

Rylko, Vojtěch January 2014 (has links)
In this master thesis are described the history and theoretical background of named-entity recognition and implementation of the system in C++ for named entity recognition and disambiguation. The system uses local disambiguation method and statistics generated from the  Wikilinks web dataset. With implemented system and with alternative implementations are performed various experiments and tests. These experiments show that the system is sufficiently successful and fast. System participates in the Entity Recognition and Disambiguation Challenge 2014.
93

Conversational Engine for Transportation Systems

Sidås, Albin, Sandberg, Simon January 2021 (has links)
Today's communication between operators and professional drivers takes place through direct conversations between the parties. This thesis project explores the possibility to support the operators in classifying the topic of incoming communications and which entities are affected through the use of named entity recognition and topic classifications. By developing a synthetic training dataset, a NER model and a topic classification model was developed and evaluated to achieve F1-scores of 71.4 and 61.8 respectively. These results were explained by a low variance in the synthetic dataset in comparison to a transcribed dataset from the real world which included anomalies not represented in the synthetic dataset. The aforementioned models were integrated into the dialogue framework Emora to seamlessly handle the back and forth communication and generating responses.
94

Transforming Legal Entity Recognition

Andersson-Säll, Tim January 2021 (has links)
Transformer-based architectures have in recent years advanced state-of-the-art performance in Natural Language Processing. Researchers have successfully adapted such models to downstream tasks within NLP in a domain-specific setting. This thesis examines the application of these models to the legal domain by doing Named Entity Recognition (NER) in a setting of scarce training data. Three different pre-trained BERT models are fine-tuned on a set of 101 court case documents, whereof one model is pre-trained on legal corpora and the other two on general corpora. Experiments are run to evaluate the models’ predictive performance given smaller or larger quantities of data to fine-tune on. Results show that BERT models work reasonably well for NER with legal data. Unlike many other domain-specific BERT models, the BERT model trained on legal corpora does not outperform the base models. Modest amounts of annotated data seem sufficient for reasonably good performance.
95

[pt] DOS TERMOS ÀS ENTIDADES NO DOMÍNIO DE PETRÓLEO / [en] FROM TERMS TO ENTITIES IN THE OIL AND GAS AREA

WOGRAINE EVELYN FARIA DIAS 09 September 2021 (has links)
[pt] Este trabalho tem como objetivo identificar uma terminologia e expressões relevantes do domínio de óleo e gás (OeG) e estruturá-la como uma taxonomia, tendo em vista o levantamento de itens para a anotação de entidades dentro do domínio. Para tanto, foi construída uma lista de termos relevantes da área, com base em diversas fontes, e, em seguida, a lista foi estruturada hierarquicamente por meio de regras. O processo de elaboração da taxonomia seguiu aspectos teóricometodológicos utilizados por diversos trabalhos semelhantes dentro da área. O trabalho procura evidenciar que a identificação de uma terminologia de um domínio técnico e a sua estruturação como taxonomia podem servir como a primeira etapa do levantamento de entidades de um domínio. Por conta disso, o trabalho também se propõe a discutir estratégias para identificação de entidade mencionada (EM) e possibilitar um diálogo entre duas áreas: Processamento de Linguagem Natural (PLN) e Linguística. De maneira geral, espera-se que a taxonomia ajudar a suprir, mesmo que de forma modesta, a escassez de recursos linguísticos para as técnicas do Processamento de Linguagem Natural (PLN) e da Extração de Informação (EI), dentro da área de óleo e gás. / [en] This work aims to identify a terminology and relevant expressions of the oil and gas domain and structure it as a taxonomy. To this end, a list of relevant terms in the area was built, based on various sources, and then the list was structured hierarchically by rules. The taxonomy elaboration process followed theoretical and methodological aspects used by several similar works within the area. The work tries to show that the identification of a technical domain terminology and its structuring as a taxonomy can serve as the first stage of the identification of entities in a domain. Because of this, the work also proposes to discuss strategies for identifying named entity and to enable a dialogue between two areas: Natural Language Processing (NLP) and Linguistics. In general, the taxonomy presented is expected to supply, at least in a modest way, the lack of linguistic resources for techniques of Natural Language Processing (NLP) and Information Extraction (EI), within the area of oil and gas.
96

Automatic Voice Trading Surveillance : Achieving Speech and Named Entity Recognition in Voice Trade Calls Using Language Model Interpolation and Named Entity Abstraction

Sundberg, Martin, Ohlsson, Mikael January 2023 (has links)
This master thesis explores the effectiveness of interpolating a larger generic speech recognition model with smaller domain-specific models to enable transcription of domain-specific conversations. The study uses a corpus within the financial domain collected from the web and processed by abstracting named entities such as financial instruments, numbers, as well as names of people and companies. By substituting each named entity with a tag representing the entity type in the domain-specific corpus, each named entity can be replaced during the hypothesis search by words added to the systems pronunciation dictionary. Thus making instruments and other domain-specific terms a matter of extension by configuration.  A proof-of-concept automatic speech recognition system with the ability to transcribe and extract named entities within the constantly changing domain of voice trading was created. The system achieved a 25.08 Word Error Rate and 0.9091 F1-score using stochastic and neural net based language models. The best configuration proved to be a combination of both stochastic and neural net based domain-specific models interpolated with a generic model. This shows that even though the models were trained using the same corpus, different models learned different aspects of the material. The study was deemed successful by the authors as the Word Error Rate was improved by model interpolation and all but one named entities were found in the test recordings by all configurations. By adjusting the amount of influence the domain-specific models had against the generic model, the results improved the transcription accuracy at the cost of named entity recognition, and vice versa. Ultimately, the choice of configuration depends on the business case and the importance of named entity recognition versus accurate transcriptions.
97

Named Entity Recognition for Search Queries in the Music Domain / Identifiering av namngivna enheter för sökfrågor inom musikdomänen

Liljeqvist, Sandra January 2016 (has links)
This thesis addresses the problem of named entity recognition (NER) in music-related search queries. NER is the task of identifying keywords in text and classifying them into predefined categories. Previous work in the field has mainly focused on longer documents of editorial texts. However, in recent years, the application of NER for queries has attracted increased attention. This task is, however, acknowledged to be challenging due to queries being short, ungrammatical and containing minimal linguistic context. The usage of NER for queries is especially useful for the implementation of natural language queries in domain-specific search applications. These applications are often backed by a database, where the query format otherwise is restricted to keyword search or the usage of a formal query language. In this thesis, two techniques for NER for music-related queries are evaluated; a conditional random field based solution and a probabilistic solution based on context words. As a baseline, the most elementary implementation of NER, commonly applied on editorial text, is used. Both of the evaluated approaches outperform the baseline and demonstrate an overall F1 score of 79.2% and 63.4% respectively. The experimental results show a high precision for the probabilistic approach and the conditional random field based solution demonstrates an F1 score comparable to previous studies from other domains. / Denna avhandling redogör för identifiering av namngivna enheter i musikrelaterade sökfrågor. Identifiering av namngivna enheter innebär att extrahera nyckelord från text och att klassificera dessa till någon av ett antal förbestämda kategorier. Tidigare forskning kring ämnet har framför allt fokuserat på längre redaktionella dokument. Däremot har intresset för tillämpningar på sökfrågor ökat de senaste åren. Detta anses vara ett svårt problem då sökfrågor i allmänhet är korta, grammatiskt inkorrekta och innehåller minimal språklig kontext. Identifiering av namngivna enheter är framför allt användbart för domänspecifika sökapplikationer där målet är att kunna tolka sökfrågor skrivna med naturligt språk. Dessa applikationer baseras ofta på en databas där formatet på sökfrågorna annars är begränsat till att enbart använda nyckelord eller användande av ett formellt frågespråk. I denna avhandling har två tekniker för identifiering av namngivna enheter för musikrelaterade sökfrågor undersökts; en metod baserad på villkorliga slumpfält (eng. conditional random field) och en probabilistisk metod baserad på kontextord. Som baslinje har den mest grundläggande implementationen, som vanligtvis används för redaktionella texter, valts. De båda utvärderade metoderna presterar bättre än baslinjen och ges ett F1-värde på 79,2% respektive 63,4%. De experimentella resultaten visar en hög precision för den probabilistiska implementationen och metoden ba- serad på villkorliga slumpfält visar på resultat på en nivå jämförbar med tidigare studier inom andra domäner.
98

Entity Information Extraction using Structured and Semi-structured resources

Sil, Avirup January 2014 (has links)
Among all the tasks that exist in Information Extraction, Entity Linking, also referred to as entity disambiguation or entity resolution, is a new and important problem which has recently caught the attention of a lot of researchers in the Natural Language Processing (NLP) community. The task involves linking/matching a textual mention of a named-entity (like a person or a movie-name) to an appropriate entry in a database (e.g. Wikipedia or IMDB). If the database does not contain the entity it should return NIL (out-of-database) value. Existing techniques for linking named entities in text mostly focus on Wikipedia as a target catalog of entities. Yet for many types of entities, such as restaurants and cult movies, relational databases exist that contain far more extensive information than Wikipedia. In this dissertation, we introduce a new framework, called Open-Database Entity Linking (Open-DB EL), in which a system must be able to resolve named entities to symbols in an arbitrary database, without requiring labeled data for each new database. In experiments on two domains, our Open-DB EL strategies outperform a state-of-the-art Wikipedia EL system by over 25% in accuracy. Existing approaches typically perform EL using a pipeline architecture: they use a Named-Entity Recognition (NER) system to find the boundaries of mentions in text, and an EL system to connect the mentions to entries in structured or semi-structured repositories like Wikipedia. However, the two tasks are tightly coupled, and each type of system can benefit significantly from the kind of information provided by the other. We propose and develop a joint model for NER and EL, called NEREL, that takes a large set of candidate mentions from typical NER systems and a large set of candidate entity links from EL systems, and ranks the candidate mention-entity pairs together to make joint predictions. In NER and EL experiments across three datasets, NEREL significantly outperforms or comes close to the performance of two state-of-the-art NER systems, and it outperforms 6 competing EL systems. On the benchmark MSNBC dataset, NEREL, provides a 60% reduction in error over the next best NER system and a 68% reduction in error over the next-best EL system. We also extend the idea of using semi-structured resources to a relatively less explored area of entity information extraction. Most previous work on information extraction from text has focused on named-entity recognition, entity linking, and relation extraction. Much less attention has been paid to extracting the temporal scope for relations between named-entities; for example, the relation president-Of (John F. Kennedy, USA) is true only in the time-frame (January 20, 1961 - November 22, 1963). In this dissertation we present a system for temporal scoping of relational facts, called TSRF which is trained on distant supervision based on the largest semi-structured resource available: Wikipedia. TSRF employs language models consisting of patterns automatically bootstrapped from sentences collected from Wikipedia pages that contain the main entity of a page and slot-fillers extracted from the infobox tuples. This proposed system achieves state-of-the-art results on 6 out of 7 relations on the benchmark Text Analysis Conference (TAC) 2013 dataset for the task of temporal slot filling (TSF). Overall, the system outperforms the next best system that participated in the TAC evaluation by 10 points on the TAC-TSF evaluation metric. / Computer and Information Science
99

Uncertainty Estimation on Natural Language Processing

He, Jianfeng 15 May 2024 (has links)
Text plays a pivotal role in our daily lives, encompassing various forms such as social media posts, news articles, books, reports, and more. Consequently, Natural Language Processing (NLP) has garnered widespread attention. This technology empowers us to undertake tasks like text classification, entity recognition, and even crafting responses within a dialogue context. However, despite the expansive utility of NLP, it frequently necessitates a critical decision: whether to place trust in a model's predictions. To illustrate, consider a state-of-the-art (SOTA) model entrusted with diagnosing a disease or assessing the veracity of a rumor. An incorrect prediction in such scenarios can have dire consequences, impacting individuals' health or tarnishing their reputation. Consequently, it becomes imperative to establish a reliable method for evaluating the reliability of an NLP model's predictions, which is our focus-uncertainty estimation on NLP. Though many works have researched uncertainty estimation or NLP, the combination of these two domains is rare. This is because most NLP research emphasizes model prediction performance but tends to overlook the reliability of NLP model predictions. Additionally, current uncertainty estimation models may not be suitable for NLP due to the unique characteristics of NLP tasks, such as the need for more fine-grained information in named entity recognition. Therefore, this dissertation proposes novel uncertainty estimation methods for different NLP tasks by considering the NLP task's distinct characteristics. The NLP tasks are categorized into natural language understanding (NLU) and natural language generation (NLG, such as text summarization). Among the NLU tasks, the understanding could be on two views, global-view (e.g. text classification at document level) and local-view (e.g. natural language inference at sentence level and named entity recognition at token level). As a result, we research uncertainty estimation on three tasks: text classification, named entity recognition, and text summarization. Besides, because few-shot text classification has captured much attention recently, we also research the uncertainty estimation on few-shot text classification. For the first topic, uncertainty estimation on text classification, few uncertainty models focus on improving the performance of text classification where human resources are involved. In response to this gap, our research focuses on enhancing the accuracy of uncertainty scores by bolstering the confidence associated with winning scores. we introduce MSD, a novel model comprising three distinct components: 'mix-up,' 'self-ensembling,' and 'distinctiveness score.' The primary objective of MSD is to refine the accuracy of uncertainty scores by mitigating the issue of overconfidence in winning scores while simultaneously considering various categories of uncertainty. seamlessly integrate with different Deep Neural Networks. Extensive experiments with ablation settings are conducted on four real-world datasets, resulting in consistently competitive improvements. Our second topic focuses on uncertainty estimation on few-shot text classification (UEFTC), which has few or even only one available support sample for each class. UEFTC represents an underexplored research domain where, due to limited data samples, a UEFTC model predicts an uncertainty score to assess the likelihood of classification errors. However, traditional uncertainty estimation models in text classification are ill-suited for UEFTC since they demand extensive training data, while UEFTC operates in a few-shot scenario, typically providing just a few support samples, or even just one, per class. To tackle this challenge, we introduce Contrastive Learning from Uncertainty Relations (CLUR) as a solution tailored for UEFTC. CLUR exhibits the unique capability to be effectively trained with only one support sample per class, aided by pseudo uncertainty scores. A distinguishing feature of CLUR is its autonomous learning of these pseudo uncertainty scores, in contrast to previous approaches that relied on manual specification. Our investigation of CLUR encompasses four model structures, allowing us to evaluate the performance of three commonly employed contrastive learning components in the context of UEFTC. Our findings highlight the effectiveness of two of these components. Our third topic focuses on uncertainty estimation on sequential labeling. Sequential labeling involves the task of assigning labels to individual tokens in a sequence, exemplified by Named Entity Recognition (NER). Despite significant advancements in enhancing NER performance in prior research, the realm of uncertainty estimation for NER (UE-NER) remains relatively uncharted but is of paramount importance. This topic focuses on UE-NER, seeking to gauge uncertainty scores for NER predictions. Previous models for uncertainty estimation often overlook two distinctive attributes of NER: the interrelation among entities (where the learning of one entity's embedding depends on others) and the challenges posed by incorrect span predictions in entity extraction. To address these issues, we introduce the Sequential Labeling Posterior Network (SLPN), designed to estimate uncertainty scores for the extracted entities while considering uncertainty propagation from other tokens. Additionally, we have devised an evaluation methodology tailored to the specific nuances of wrong-span cases. Our fourth topic focuses on an overlooked question that persists regarding the evaluation reliability of uncertainty estimation in text summarization (UE-TS). Text summarization, a key task in natural language generation (NLG), holds significant importance, particularly in domains where inaccuracies can have serious consequences, such as healthcare. UE-TS has garnered attention due to the potential risks associated with erroneous summaries. However, the reliability of evaluating UE-TS methods raises concerns, stemming from the interdependence between uncertainty model metrics and the wide array of NLG metrics. To address these concerns, we introduce a comprehensive UE-TS benchmark incorporating twenty-six NLG metrics across four dimensions. This benchmark evaluates the uncertainty estimation capabilities of two large language models and one pre-trained language model across two datasets. Additionally, it assesses the effectiveness of fourteen common uncertainty estimation methods. Our study underscores the necessity of utilizing diverse, uncorrelated NLG metrics and uncertainty estimation techniques for a robust evaluation of UE-TS methods. / Doctor of Philosophy / Text is integral to our daily activities, appearing in various forms such as social media posts, news articles, books, and reports. We rely on text for communication, information dissemination, and decision-making. Given its ubiquity, the ability to process and understand text through Natural Language Processing (NLP) has become increasingly important. NLP technology enables us to perform tasks like text classification, which involves categorizing text into predefined labels, and named entity recognition (NER), which identifies specific entities such as names, dates, and locations within text. Additionally, NLP facilitates generating coherent and contextually appropriate responses in conversational agents, enhancing human-computer interaction. However, the reliability of NLP models is crucial, especially in sensitive applications like medical diagnoses, where errors can have severe consequences. This dissertation focuses on uncertainty estimation in NLP, a less explored but essential area. Uncertainty estimation helps evaluate the confidence of NLP model predictions. We propose new methods tailored to various NLP tasks, acknowledging their unique needs. NLP tasks are divided into natural language understanding (NLU) and natural language generation (NLG). Within NLU, we look at tasks from two perspectives: a global view (e.g., document-level text classification) and a local view (e.g., sentence-level inference and token-level entity recognition). Our research spans text classification, named entity recognition (NER), and text summarization, with a special focus on few-shot text classification due to its recent prominence. For text classification, we introduce the MSD model, which includes three components to enhance uncertainty score accuracy and address overconfidence issues. This model integrates seamlessly with different neural networks and shows consistent improvements in experiments. For few-shot text classification, we develop Contrastive Learning from Uncertainty Relations (CLUR), designed to work effectively with minimal support samples per class. CLUR autonomously learns pseudo uncertainty scores, demonstrating effectiveness with various contrastive learning components. In NER, we address the unique challenges of entity interrelation and span prediction errors. We propose the Sequential Labeling Posterior Network (SLPN) to estimate uncertainty scores while considering uncertainty propagation from other tokens. For text summarization, we create a benchmark with tens of metrics to evaluate uncertainty estimation methods across two datasets. This benchmark helps assess the reliability of these methods, highlighting the need for diverse, uncorrelated metrics. Overall, our work advances the understanding and implementation of uncertainty estimation in NLP, providing more reliable and accurate predictions across different tasks.
100

Identifying Sensitive Data using Named Entity Recognition with Large Language Models : A comparison of transformer models fine-tuned for Named Entity Recognition

Ström Boman, Alfred January 2024 (has links)
Utvecklingen av artificiell intelligens och språkmodeller har ökat drastiskt under de senaste åren vilket medfört både möjligheter såväl som risker. Med en större användning av AI-relaterade produkter och människolika chattbotar har det medfört ett intresse av att kontrollera vilken sorts data som delas med dessa verktyg. Under särskilda omständigheter kan det förekomma data som till exempel information relaterat till personer, som inte får delas. Detta projekt har av denna anledning kretsat kring att använda och jämföra olika system för automatisk namnigenkänning, med målet att förhindra sådan data från att bli delad. I projektet jämfördes tre stycken olika alternativ för att implementera system för namnigenkänning, innan det mest lämpliga alternativet valdes för implementationen. Fortsättningsvis användes de tre förtränade transformer-modellerna GPT-SW3, TinyLlama och Mistral för implementationen där dessa tre blev finjusterade på två olika dataset. Implementationsfasen involverade applicering av tekniker för att öka datastorleken, databearbetning samt modellkvantisering innan de finjusterades för namnigenkänning. En uppsättning av utvärderingsmått bestående av bland annat F1-mått användes därefter för att mäta de tränade modellernas prestanda. De tre modellerna utvärderades och jämfördes med varandra utifrån resultatet från mätningen och träningen. Modellerna uppvisade varierande resultat och prestanda där både över- och underanpassning förekom. Avslutningsvis drogs slutsatsen om att TinyLlama var den bäst presterande modellen utifrån resultatet och övriga kringliggande aspekter. / The development of artificial intelligence and large language models has increased rapidly in recent years, bringing both opportunities and risks. With a broader use of AI related products such as human-like chatbots there has been an increase in interest in controlling the data that is being shared with them. In some scenarios there is data, such as personal or proprietary information, which should not be shared. This project has therefore revolved around utilizing and comparing different Named Entity Recognition systems to prevent such data from being shared. Three different approaches to implement Named Entity Recognition systems were compared before selecting the most appropriate one to further use for the actual implementation. Furthermore, three pre-trained transformer models, GPT-SW3, TinyLlama and Mistral, were used for the implementation where these were fine-tuned on two different datasets. The implementation phase included applying data augmentation techniques, data processing and model quantization before fine-tuning the models on Named Entity Recognition. A set of metrics including precision, recall and F1-score was further used to measure the performances of the trained models. The three models were compared and evaluated against each other based on the results obtained from the measurements and the training. The models showed varying results and performances where both overfitting and underfitting occured. Finally, the TinyLlama model was concluded to be the best model based on the obtained results and other considered aspects.

Page generated in 0.028 seconds