Spelling suggestions: "subject:"[een] NAMED ENTITY"" "subject:"[enn] NAMED ENTITY""
81 |
Data Fusion and Text Mining for Supporting Journalistic WorkZsombor, Vermes January 2022 (has links)
During the past several decades, journalists have been struggling with the ever growing amount of data on the internet. Investigating the validity of the sources or finding similar articles for a story can consume a lot of time and effort. These issues are even amplified by the declining size of the staff of news agencies. The solution is to empower the remaining professional journalists with digital tools created by computer scientists. This thesis project is inspired by an idea to provide software support for journalistic work with interactive visual interfaces and artificial intelligence. More specifically, within the scope of this thesis project, we created a backend module that supports several text mining methods such as keyword extraction, named entity recognition, sentiment analysis, fake news classification and also data collection from various data sources to help professionals in the field of journalism. To implement our system, first we gathered the requirements from several researchers and practitioners in journalism, media studies, and computer science, then acquired knowledge by reviewing literature on current approaches. Results are evaluated both with quantitative methods such as individual component benchmarks and also with qualitative methods by analyzing the outcomes of the semi-structured interviews with collaborating and external domain experts. Our results show that there is similarity between the domain experts' perceived value and the performance of the components on the individual evaluations. This shows us that there is potential in this research area and future work would be welcomed by the journalistic community.
|
82 |
Investigating Few-Shot Transfer Learning for Address Parsing : Fine-Tuning Multilingual Pre-Trained Language Models for Low-Resource Address Segmentation / En Undersökning av Överföringsinlärning för Adressavkodning med Få Exempel : Finjustering av För-Tränade Språkmodeller för Låg-Resurs Adress SegmenteringHeimisdóttir, Hrafndís January 2022 (has links)
Address parsing is the process of splitting an address string into its different address components, such as street name, street number, et cetera. Address parsing has been quite extensively researched and there exist some state-ofthe-art address parsing solutions, mostly unilingual. In more recent years research has emerged which focuses on multinational address parsing and deep architecture address parsers have been used to achieve state-of-the-art performance on multinational address data. However, training these deep architectures for address parsing requires a rather large amount of address data which is not always accessible. Generally within Natural Language Processing (NLP) data is difficult to come by and most of the NLP data available consists of data from about only 20 of the approximately 7000 languages spoken around the world, so-called high-resource languages. This also applies to address data, which can be difficult to come by for some of the so-called low-resource languages of the world for which little or no NLP data exists. To attempt to deal with the lack of address data availability for some of the less spoken languages of the world, the current project investigates the potential of FewShot Learning (FSL) for multinational address parsing. To investigate this, two few-shot transfer learning models are implemented, both implementations consist of a fine-tuned pre-trained language model (PTLM). The difference between the two models is the PTLM used, which were the multilingual language models mBERT and XLM-R, respectively. The two PTLMs are finetuned using a linear classifier layer to then be used as multinational address parsers. The two models are trained and their results are compared with a state-of-the-art multinational address parser, Deepparse, as well as with each other. Results show that the two models do not outperform Deepparse, but they do show promising results, not too far from what Deepparse achieves on holdout and zero-shot datasets. On a mix of low- and high-resource language address data, both models perform well and achieve over 96% on the overall F1-score. Out of the two models used for implementation, XLM-R achieves significantly better results than mBERT and can therefore be considered the more appropriate PTLM to use for multinational FSL address parsing. Based on these results the conclusion is that there is great potential for FSL within the field of multinational address parsing and that general FSL methods can be used and perform well on multinational address parsing tasks. / Adressavkodning är processen att dela upp en adresssträng i dess olika adresskomponenter såsom gatunamn, gatunummer, et cetera. Adressavkodning har undersökts ganska omfattande och det finns några toppmoderna adressavkodningslösningar, mestadels enspråkiga. Senaste åren har forskning fokuserad på multinationell adressavkodning börjat dyka upp och djupa arkitekturer för adressavkodning har använts för att uppnå toppmodern prestation på multinationell adressdata. Att träna dessa arkitekturer kräver dock en ganska stor mängd adressdata, vilket inte alltid är tillgängligt. Det är generellt svårt att få tag på data inom naturlig språkbehandling och majoriteten av den data som är tillgänglig består av data från endast 20 av de cirka 7000 språk som används runt om i världen, så kallade högresursspråk. Detta gäller även för adressdata, vilket kan vara svårt att få tag på för vissa av världens så kallade resurssnåla språk för vilka det finns lite eller ingen data för naturlig språkbehandling. För att försöka behandla denna brist på adressdata för några av världens mindre talade språk undersöker detta projekt om det finns någon potential för inlärning med få exempel för multinationell adressavkodning. För detta implementeras två modeller för överföringsinlärning med få exempel genom finjustering av förtränade språkmodeller. Skillnaden mellan de två modellerna är den förtränade språkmodellen som används, mBERT respektive XLM-R. Båda modellerna finjusteras med hjälp av ett linjärt klassificeringsskikt för att sedan användas som multinationella addressavkodare. De två modellerna tränas och deras resultat jämförs med en toppmodern multinationell adressavkodare, Deepparse. Resultaten visar att de två modellerna presterar båda sämre än Deepparse modellen, men de visar ändå lovande resultat, inte långt ifrån vad Deepparse uppnår för både holdout och zero-shot dataset. Vidare, så presterar båda modeller bra på en blandning av adressdata från låg- och högresursspråk och båda modeller uppnår över 96% övergripande F1-score. Av de två modellerna uppnår XLM-R betydligt bättre resultat än mBERT och kan därför anses vara en mer lämplig förtränad språkmodell att använda för multinationell inlärning med få exempel för addressavkodning. Utifrån dessa resultat dras slutsatsen att det finns stor potential för inlärning med få exempel inom området multinationall adressavkodning, samt att generella metoder för inlärning med få exempel kan användas och preseterar bra på multinationella adressavkodningsuppgifter.
|
83 |
Fine-tuning a BERT-based NER Model for Positive Energy DistrictsOrtega, Karen, Sun, Fei January 2023 (has links)
This research presents an innovative approach to extracting information from Positive Energy Districts (PEDs), urban areas generating surplus energy. PEDs are integral to the European Commission's SET Plan, tackling housing challenges arising from population growth. The study refines BERT to categorize PED-related entities, producing a cutting-edge NER model and an integrated pipeline of diverse NER tools and data sources. The model achieves an accuracy of 0.81 and an F1 Score of 0.55 with notably high confidence scores through pipeline evaluations, confirming its practical applicability. While the F1 score falls short of expectations, this pioneering exploration in PED information extraction sets the stage for future refinements and studies, promising enhanced methodologies and impactful outcomes in this dynamic field. This research advances NER processes for Positive Energy Districts, supporting their development and implementation.
|
84 |
Active Learning for Named Entity Recognition with Swedish Language Models / Aktiv Inlärning för Namnigenkänning med Svenska SpråkmodellerÖhman, Joey January 2021 (has links)
The recent advancements of Natural Language Processing have cleared the path for many new applications. This is primarily a consequence of the transformer model and the transfer-learning capabilities provided by models like BERT. However, task-specific labeled data is required to fine-tune these models. To alleviate the expensive process of labeling data, Active Learning (AL) aims to maximize the information gained from each label. By including a model in the annotation process, the informativeness of each unlabeled sample can be estimated and hence allow human annotators to focus on vital samples and avoid redundancy. This thesis investigates to what extent AL can accelerate model training with respect to the number of labels required. In particular, the focus is on pre- trained Swedish language models in the context of Named Entity Recognition. The data annotation process is simulated using existing labeled datasets to evaluate multiple AL strategies. Experiments are evaluated by analyzing the F1 score achieved by models trained on the data selected by each strategy. The results show that AL can significantly accelerate the model training and hence reduce the manual annotation effort. The state-of-the-art strategy for sentence classification, ALPS, shows no sign of accelerating the model training. However, uncertainty-based strategies consistently outperform random selection. Under certain conditions, these strategies can reduce the number of labels required by more than a factor of two. / Framstegen som nyligen har gjorts inom naturlig språkbehandling har möjliggjort många nya applikationer. Det är mestadels till följd av transformer-modellerna och lärandeöverföringsmöjligheterna som kommer med modeller som BERT. Däremot behövs det fortfarande uppgiftsspecifik annoterad data för att finjustera dessa modeller. För att lindra den dyra processen att annotera data, strävar aktiv inlärning efter att maximera informationen som utvinns i varje annotering. Genom att inkludera modellen i annoteringsprocessen, kan man estimera hur informationsrikt varje träningsexempel är, och på så sätt låta mänskilga annoterare fokusera på viktiga datapunkter. Detta examensarbete utforskar hur väl aktiv inlärning kan accelerera modellträningen med avseende på hur många annoterade träningsexempel som behövs. Fokus ligger på förtränade svenska språkmodeller och uppgiften namnigenkänning. Dataannoteringsprocessen simuleras med färdigannoterade dataset för att evaluera flera olika strategier för aktiv inlärning. Experimenten evalueras genom att analysera den uppnådda F1-poängen av modeller som är tränade på datapunkterna som varje strategi har valt. Resultaten visar att aktiv inlärning har en signifikant förmåga att accelerera modellträningen och reducera de manuella annoteringskostnaderna. Den toppmoderna strategin för meningsklassificering, ALPS, visar inget tecken på att kunna accelerera modellträningen. Däremot är osäkerhetsbaserade strategier är konsekvent bättre än att slumpmässigt välja datapunkter. I vissa förhållanden kan dessa strategier reducera antalet annoteringar med mer än en faktor 2.
|
85 |
Bootstrapping Annotated Job Ads using Named Entity Recognition and Swedish Language Models / Identifiering av namngivna enheter i jobbannonser genom användning av semi-övervakade tekniker och svenska språkmodellerNyqvist, Anna January 2021 (has links)
Named entity recognition (NER) is a task that concerns detecting and categorising certain information in text. A promising approach for NER that recently has emerged is fine-tuning Transformer-based language models for this specific task. However, these models may require a relatively large quantity of labelled data to perform well. This can limit NER models applicability in real-world applications as manual annotation often is costly and time-consuming. In this thesis, we investigate the learning curve of human annotation and of a NER model during a semi-supervised bootstrapping process. Special emphasis is given the dependence of the number of classes and the amount of training data used in the process. We first annotate a set of collected job advertisements and then apply bootstrapping using both annotated and unannotated data and continuously fine-tune a pre-trained Swedish BERT model. The initial class system is simplified during the bootstrapping process according to model performance and inter-annotator agreement. The model performance increased as the training set grew larger with a final micro F1-score of 54%. This result provides a good baseline, and we point out several improvements that can be made to further enhance performance. We further identify classes handled differently by the annotators and potential factors as to why this is. Suggestions for future work include adjusting the current class system further by removing classes that were identified as low-performing in this thesis. / Namngiven entitetsigenkänning (eng. named entity recognition) innebär att identifiera och kategorisera nyckelord i text. En ny lovande teknik för identifiering av namngivna enheter är att finjustera Transformerbaserade språkmodeller för denna specifika uppgift. Dessa modeller kräver dock stora mängder märkt data för att prestera väl. Detta kan begränsa antal områden i vilka de kan användas då manuell märkning av data ofta är kostsamt och tidskrävande. I denna avhandling undersöker vi inlärningskurvan för manuell annotering och för en språkmodell under en halvövervakad bootstrapping process. Särskild vikt läggs på hur modellens och annoterarnas inlärning påverkas av antal klasser och mängden träningsdata som används i processen. Vi annoterar först en samling jobbannonser och tillämpar sedan en bootstrapping process med både märkt och omärkt data i vilken en förtränad svensk BERT-modell kontinuerligt finjusteras. Det första klasssystemet förenklas under processens gång beroende på modellprestation och interannoterar-överenskommelse. Modellen presterade bättre med mer träningsdata och uppnådde en slutlig micro F1-score på 54%. Detta resultat ger en bra baslinje, och vi föreslår flera förbättringar som kan göras för att ytterligare förbättra modellprestationen. Vidare identifierar vi även klasser som hanteras olika av annoterare och potentiella faktorer till vad detta beror på. Förslag för framtida arbete inkluderar att justera det nuvarande klasssystemet ytterligare genom att ta bort klasser som identifierades som lågpresterande i denna avhandling.
|
86 |
Adaptive Semantic Annotation of Entity and Concept Mentions in TextMendes, Pablo N. 05 June 2014 (has links)
No description available.
|
87 |
Arabic named entity recognitionBenajiba, Yassine 24 May 2010 (has links)
En esta tesis doctoral se describen las investigaciones realizadas con el objetivo de determinar
las mejores tecnicas para construir un Reconocedor de Entidades Nombradas
en Arabe. Tal sistema tendria la habilidad de identificar y clasificar las entidades
nombradas que se encuentran en un texto arabe de dominio abierto.
La tarea de Reconocimiento de Entidades Nombradas (REN) ayuda a otras tareas de
Procesamiento del Lenguaje Natural (por ejemplo, la Recuperacion de Informacion, la
Busqueda de Respuestas, la Traduccion Automatica, etc.) a lograr mejores resultados
gracias al enriquecimiento que a~nade al texto. En la literatura existen diversos trabajos
que investigan la tarea de REN para un idioma especifico o desde una perspectiva
independiente del lenguaje. Sin embargo, hasta el momento, se han publicado muy
pocos trabajos que estudien dicha tarea para el arabe.
El arabe tiene una ortografia especial y una morfologia compleja, estos aspectos aportan
nuevos desafios para la investigacion en la tarea de REN. Una investigacion completa
del REN para elarabe no solo aportaria las tecnicas necesarias para conseguir
un alto rendimiento, sino que tambien proporcionara un analisis de los errores y una
discusion sobre los resultados que benefician a la comunidad de investigadores del
REN. El objetivo principal de esta tesis es satisfacer esa necesidad. Para ello hemos:
1. Elaborado un estudio de los diferentes aspectos del arabe relacionados con dicha
tarea;
2. Analizado el estado del arte del REN;
3. Llevado a cabo una comparativa de los resultados obtenidos por diferentes
tecnicas de aprendizaje automatico;
4. Desarrollado un metodo basado en la combinacion de diferentes clasificadores,
donde cada clasificador trata con una sola clase de entidades nombradas y emplea
el conjunto de caracteristicas y la tecnica de aprendizaje automatico mas
adecuados para la clase de entidades nombradas en cuestion.
Nuestros experimentos han sido evaluados sobre nueve conjuntos de test. / Benajiba, Y. (2009). Arabic named entity recognition [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/8318
|
88 |
[en] AN END-TO-END MODEL FOR JOINT ENTITY AND RELATION EXTRACTION IN PORTUGUESE / [pt] MODELO END-TO-END PARA EXTRAÇÃO DE ENTIDADES E RELAÇÕES DE FORMA CONJUNTA EM PORTUGUÊSLUCAS AGUIAR PAVANELLI 24 October 2022 (has links)
[pt] As técnicas de processamento de linguagem natural (NLP) estão se tornando
populares recentemente. A gama de aplicativos que se beneficiam de
NLP é extensa, desde criar sistemas de tradução automática até ajudar no
marketing de um produto. Dentro de NLP, o campo de Extração de Informações
(IE) é difundido; concentra-se no processamento de textos para recuperar
informações específicas sobre uma determinada entidade ou conceito. Ainda
assim, a comunidade de pesquisa se concentra principalmente na construção
de modelos para dados na língua inglesa. Esta tese aborda três tarefas no
domínio do IE: Reconhecimento de Entidade Nomeada, Extração de Relações
Semânticas e Extração Conjunta de Entidade e Relação. Primeiro, criamos um
novo conjunto de dados em português no domínio biomédico, descrevemos o
processo de anotação e medimos suas propriedades. Além disso, desenvolvemos
um novo modelo para a tarefa de Extração Conjunta de Entidade e Relação,
verificando que o mesmo é competitivo em comparação com outros modelos.
Finalmente, avaliamos cuidadosamente os modelos propostos em textos de idiomas
diferentes do inglês e confirmamos a dominância de modelos baseados
em redes neurais. / [en] Natural language processing (NLP) techniques are becoming popular recently.
The range of applications that benefit from NLP is extensive, from
building machine translation systems to helping market a product. Within
NLP, the Information Extraction (IE) field is widespread; it focuses on processing
texts to retrieve specific information about a particular entity or concept.
Still, the research community mainly focuses on building models for English
data. This thesis addresses three tasks in the IE domain: Named Entity Recognition, Relation Extraction, and Joint Entity and Relation Extraction. First,
we created a novel Portuguese dataset in the biomedical domain, described the
annotation process, and measured its properties. Also, we developed a novel
model for the Joint Entity and Relation Extraction task, verifying that it is
competitive compared to other models. Finally, we carefully evaluated proposed
models on non-English language datasets and confirmed the dominance of
neural-based models.
|
89 |
Extracting and Aggregating Temporal Events from TextsDöhling, Lars 11 October 2017 (has links)
Das Finden von zuverlässigen Informationen über gegebene Ereignisse aus großen und dynamischen Textsammlungen, wie dem Web, ist ein wichtiges Thema. Zum Beispiel sind Rettungsteams und Versicherungsunternehmen an prägnanten Fakten über Schäden nach Katastrophen interessiert, die heutzutage online in Web-Blogs, Zeitungsartikeln, Social Media etc. zu finden sind. Solche Fakten helfen, die erforderlichen Hilfsmaßnahmen zu bestimmen und unterstützen deren Koordination. Allerdings ist das Finden, Extrahieren und Aggregieren nützlicher Informationen ein hochkomplexes Unterfangen: Es erfordert die Ermittlung geeigneter Textquellen und deren zeitliche Einordung, die Extraktion relevanter Fakten in diesen Texten und deren Aggregation zu einer verdichteten Sicht auf die Ereignisse, trotz Inkonsistenzen, vagen Angaben und Veränderungen über die Zeit. In dieser Arbeit präsentieren und evaluieren wir Techniken und Lösungen für jedes dieser Probleme, eingebettet in ein vierstufiges Framework. Die angewandten Methoden beruhen auf Verfahren des Musterabgleichs, der Verarbeitung natürlicher Sprache und des maschinellen Lernens. Zusätzlich berichten wir über die Ergebnisse zweier Fallstudien, basierend auf dem Einsatz des gesamten Frameworks: Die Ermittlung von Daten über Erdbeben und Überschwemmungen aus Webdokumenten. Unsere Ergebnisse zeigen, dass es unter bestimmten Umständen möglich ist, automatisch zuverlässige und zeitgerechte Daten aus dem Internet zu erhalten. / Finding reliable information about given events from large and dynamic text collections, such as the web, is a topic of great interest. For instance, rescue teams and insurance companies are interested in concise facts about damages after disasters, which can be found today in web blogs, online newspaper articles, social media, etc. Knowing these facts helps to determine the required scale of relief operations and supports their coordination. However, finding, extracting, and condensing specific facts is a highly complex undertaking: It requires identifying appropriate textual sources and their temporal alignment, recognizing relevant facts within these texts, and aggregating extracted facts into a condensed answer despite inconsistencies, uncertainty, and changes over time. In this thesis, we present and evaluate techniques and solutions for each of these problems, embedded in a four-step framework. Applied methods are pattern matching, natural language processing, and machine learning. We also report the results for two case studies applying our entire framework: gathering data on earthquakes and floods from web documents. Our results show that it is, under certain circumstances, possible to automatically obtain reliable and timely data from the web.
|
90 |
Predicting Linguistic Structure with Incomplete and Cross-Lingual SupervisionTäckström, Oscar January 2013 (has links)
Contemporary approaches to natural language processing are predominantly based on statistical machine learning from large amounts of text, which has been manually annotated with the linguistic structure of interest. However, such complete supervision is currently only available for the world's major languages, in a limited number of domains and for a limited range of tasks. As an alternative, this dissertation considers methods for linguistic structure prediction that can make use of incomplete and cross-lingual supervision, with the prospect of making linguistic processing tools more widely available at a lower cost. An overarching theme of this work is the use of structured discriminative latent variable models for learning with indirect and ambiguous supervision; as instantiated, these models admit rich model features while retaining efficient learning and inference properties. The first contribution to this end is a latent-variable model for fine-grained sentiment analysis with coarse-grained indirect supervision. The second is a model for cross-lingual word-cluster induction and the application thereof to cross-lingual model transfer. The third is a method for adapting multi-source discriminative cross-lingual transfer models to target languages, by means of typologically informed selective parameter sharing. The fourth is an ambiguity-aware self- and ensemble-training algorithm, which is applied to target language adaptation and relexicalization of delexicalized cross-lingual transfer parsers. The fifth is a set of sequence-labeling models that combine constraints at the level of tokens and types, and an instantiation of these models for part-of-speech tagging with incomplete cross-lingual and crowdsourced supervision. In addition to these contributions, comprehensive overviews are provided of structured prediction with no or incomplete supervision, as well as of learning in the multilingual and cross-lingual settings. Through careful empirical evaluation, it is established that the proposed methods can be used to create substantially more accurate tools for linguistic processing, compared to both unsupervised methods and to recently proposed cross-lingual methods. The empirical support for this claim is particularly strong in the latter case; our models for syntactic dependency parsing and part-of-speech tagging achieve the hitherto best published results for a wide number of target languages, in the setting where no annotated training data is available in the target language.
|
Page generated in 0.0905 seconds