Global ETD Search

71	Annotating Job Titles in Job Ads using Swedish Language Models Ridhagen, Markus January 2023 (has links) This thesis investigates automated annotation approaches to assist public authorities in Sweden in optimizing resource allocation and gaining valuable insights to enhance the preservation of high-quality welfare. The study uses pre-trained Swedish language models for the named entity recognition (NER) task of finding job titles in job advertisements from The Swedish Public Employment Service, Arbetsförmedlingen. Specifically, it evaluates the performance of the Swedish Bidirectional Encoder Representations from Transformers (BERT), developed by the National Library of Sweden (KB), referred to as KB-BERT. The thesis explores the impact of training data size on the models’ performance and examines whether active learning can enhance efficiency and accuracy compared to random sampling. The findings reveal that even with a small training dataset of 220 job advertisements, KB-BERT achieves a commendable F1-score of 0.770 in predicting job titles. The model’s performance improves further by augmenting the training data with an additional 500 annotated job advertisements, yielding an F1-score of 0.834. Notably, the highest F1-score of 0.856 is achieved by applying the active learning strategy of uncertainty sampling and the measure of mean entropy. The test data provided by Arbetsförmedlingen was re-annotated to evaluate the complexity of the task. The human annotator achieved an F1-score of 0.883. Based on these findings, it can be inferred that KB-BERT performs satisfactorily in classifying job titles from job ads. Natural language processing NLP Named-entity recognition NER BERT Active Learning Probability Theory and Statistics Sannolikhetsteori och statistik
72	Exploring Construction of a Company Domain-Specific Knowledge Graph from Financial Texts Using Hybrid Information Extraction Jen, Chun-Heng January 2021 (has links) Companies do not exist in isolation. They are embedded in structural relationships with each other. Mapping a given company’s relationships with other companies in terms of competitors, subsidiaries, suppliers, and customers are key to understanding a company’s major risk factors and opportunities. Conventionally, obtaining and staying up to date with this key knowledge was achieved by reading financial news and reports by highly skilled manual labor like a financial analyst. However, with the development of Natural Language Processing (NLP) and graph databases, it is now possible to systematically extract and store structured information from unstructured data sources. The current go-to method to effectively extract information uses supervised machine learning models, which require a large amount of labeled training data. The data labeling process is usually time-consuming and hard to get in a domain-specific area. This project explores an approach to construct a company domain-specific Knowledge Graph (KG) that contains company-related entities and relationships from the U.S. Securities and Exchange Commission (SEC) 10-K filings by combining a pre-trained general NLP with rule-based patterns in Named Entity Recognition (NER) and Relation Extraction (RE). This approach eliminates the time-consuming data-labeling task in the statistical approach, and by evaluating ten 10-k filings, the model has the overall Recall of 53.6%, Precision of 75.7%, and the F1-score of 62.8%. The result shows it is possible to extract company information using the hybrid methods, which does not require a large amount of labeled training data. However, the project requires the time-consuming process of finding lexical patterns from sentences to extract company-related entities and relationships. / Företag existerar inte som isolerade organisationer. De är inbäddade i strukturella relationer med varandra. Att kartlägga ett visst företags relationer med andra företag när det gäller konkurrenter, dotterbolag, leverantörer och kunder är nyckeln till att förstå företagets huvudsakliga riskfaktorer och möjligheter. Det konventionella sättet att hålla sig uppdaterad med denna viktiga kunskap var genom att läsa ekonomiska nyheter och rapporter från högkvalificerad manuell arbetskraft som till exempel en finansanalytiker. Men med utvecklingen av ”Natural Language Processing” (NLP) och grafdatabaser är det nu möjligt att systematiskt extrahera och lagra strukturerad information från ostrukturerade datakällor. Den nuvarande metoden för att effektivt extrahera information använder övervakade maskininlärningsmodeller som kräver en stor mängd märkta träningsdata. Datamärkningsprocessen är vanligtvis tidskrävande och svår att få i ett domänspecifikt område. Detta projekt utforskar ett tillvägagångssätt för att konstruera en företagsdomänspecifikt ”Knowledge Graph” (KG) som innehåller företagsrelaterade enheter och relationer från SEC 10-K-arkivering genom att kombinera en i förväg tränad allmän NLP med regelbaserade mönster i ”Named Entity Recognition” (NER) och ”Relation Extraction” (RE). Detta tillvägagångssätt eliminerar den tidskrävande datamärkningsuppgiften i det statistiska tillvägagångssättet och genom att utvärdera tio SEC 10-K arkiv har modellen den totala återkallelsen på 53,6 %, precision på 75,7 % och F1-poängen på 62,8 %. Resultatet visar att det är möjligt att extrahera företagsinformation med hybridmetoderna, vilket inte kräver en stor mängd märkta träningsdata. Projektet kräver dock en tidskrävande process för att hitta lexikala mönster från meningar för att extrahera företagsrelaterade enheter och relationer. Natural Language Processing Information Extraction Named Entity Recognition Relation Extraction Knowledge Graph Naturlig språkbehandling Informationsextraktion Namngiven Entitetsigenkänning Relationsextraktion Kunskapsgraf Computer and Information Sciences Data- och informationsvetenskap
73	Data Fusion and Text Mining for Supporting Journalistic Work Zsombor, Vermes January 2022 (has links) During the past several decades, journalists have been struggling with the ever growing amount of data on the internet. Investigating the validity of the sources or finding similar articles for a story can consume a lot of time and effort. These issues are even amplified by the declining size of the staff of news agencies. The solution is to empower the remaining professional journalists with digital tools created by computer scientists. This thesis project is inspired by an idea to provide software support for journalistic work with interactive visual interfaces and artificial intelligence. More specifically, within the scope of this thesis project, we created a backend module that supports several text mining methods such as keyword extraction, named entity recognition, sentiment analysis, fake news classification and also data collection from various data sources to help professionals in the field of journalism. To implement our system, first we gathered the requirements from several researchers and practitioners in journalism, media studies, and computer science, then acquired knowledge by reviewing literature on current approaches. Results are evaluated both with quantitative methods such as individual component benchmarks and also with qualitative methods by analyzing the outcomes of the semi-structured interviews with collaborating and external domain experts. Our results show that there is similarity between the domain experts' perceived value and the performance of the components on the individual evaluations. This shows us that there is potential in this research area and future work would be welcomed by the journalistic community. text mining data fusion algorithmic journalism computational journalism keyword extraction named entity recognition sentiment analysis fake news classification Computer and Information Sciences Data- och informationsvetenskap
74	[en] AN END-TO-END MODEL FOR JOINT ENTITY AND RELATION EXTRACTION IN PORTUGUESE / [pt] MODELO END-TO-END PARA EXTRAÇÃO DE ENTIDADES E RELAÇÕES DE FORMA CONJUNTA EM PORTUGUÊS LUCAS AGUIAR PAVANELLI 24 October 2022 (has links) [pt] As técnicas de processamento de linguagem natural (NLP) estão se tornando populares recentemente. A gama de aplicativos que se beneficiam de NLP é extensa, desde criar sistemas de tradução automática até ajudar no marketing de um produto. Dentro de NLP, o campo de Extração de Informações (IE) é difundido; concentra-se no processamento de textos para recuperar informações específicas sobre uma determinada entidade ou conceito. Ainda assim, a comunidade de pesquisa se concentra principalmente na construção de modelos para dados na língua inglesa. Esta tese aborda três tarefas no domínio do IE: Reconhecimento de Entidade Nomeada, Extração de Relações Semânticas e Extração Conjunta de Entidade e Relação. Primeiro, criamos um novo conjunto de dados em português no domínio biomédico, descrevemos o processo de anotação e medimos suas propriedades. Além disso, desenvolvemos um novo modelo para a tarefa de Extração Conjunta de Entidade e Relação, verificando que o mesmo é competitivo em comparação com outros modelos. Finalmente, avaliamos cuidadosamente os modelos propostos em textos de idiomas diferentes do inglês e confirmamos a dominância de modelos baseados em redes neurais. / [en] Natural language processing (NLP) techniques are becoming popular recently. The range of applications that benefit from NLP is extensive, from building machine translation systems to helping market a product. Within NLP, the Information Extraction (IE) field is widespread; it focuses on processing texts to retrieve specific information about a particular entity or concept. Still, the research community mainly focuses on building models for English data. This thesis addresses three tasks in the IE domain: Named Entity Recognition, Relation Extraction, and Joint Entity and Relation Extraction. First, we created a novel Portuguese dataset in the biomedical domain, described the annotation process, and measured its properties. Also, we developed a novel model for the Joint Entity and Relation Extraction task, verifying that it is competitive compared to other models. Finally, we carefully evaluated proposed models on non-English language datasets and confirmed the dominance of neural-based models. [pt] PROCESSAMENTO DE LINGUAGEM NATURAL [pt] EXTRACAO DE RELACOES SEMANTICAS [pt] APRENDIZAGEM PROFUNDA [en] NATURAL LANGUAGE PROCESSING [en] RELATION EXTRACTION [en] NAMED ENTITY RECOGNITION [en] DEEP LEARNING
75	Investigating Few-Shot Transfer Learning for Address Parsing : Fine-Tuning Multilingual Pre-Trained Language Models for Low-Resource Address Segmentation / En Undersökning av Överföringsinlärning för Adressavkodning med Få Exempel : Finjustering av För-Tränade Språkmodeller för Låg-Resurs Adress Segmentering Heimisdóttir, Hrafndís January 2022 (has links) Address parsing is the process of splitting an address string into its different address components, such as street name, street number, et cetera. Address parsing has been quite extensively researched and there exist some state-ofthe-art address parsing solutions, mostly unilingual. In more recent years research has emerged which focuses on multinational address parsing and deep architecture address parsers have been used to achieve state-of-the-art performance on multinational address data. However, training these deep architectures for address parsing requires a rather large amount of address data which is not always accessible. Generally within Natural Language Processing (NLP) data is difficult to come by and most of the NLP data available consists of data from about only 20 of the approximately 7000 languages spoken around the world, so-called high-resource languages. This also applies to address data, which can be difficult to come by for some of the so-called low-resource languages of the world for which little or no NLP data exists. To attempt to deal with the lack of address data availability for some of the less spoken languages of the world, the current project investigates the potential of FewShot Learning (FSL) for multinational address parsing. To investigate this, two few-shot transfer learning models are implemented, both implementations consist of a fine-tuned pre-trained language model (PTLM). The difference between the two models is the PTLM used, which were the multilingual language models mBERT and XLM-R, respectively. The two PTLMs are finetuned using a linear classifier layer to then be used as multinational address parsers. The two models are trained and their results are compared with a state-of-the-art multinational address parser, Deepparse, as well as with each other. Results show that the two models do not outperform Deepparse, but they do show promising results, not too far from what Deepparse achieves on holdout and zero-shot datasets. On a mix of low- and high-resource language address data, both models perform well and achieve over 96% on the overall F1-score. Out of the two models used for implementation, XLM-R achieves significantly better results than mBERT and can therefore be considered the more appropriate PTLM to use for multinational FSL address parsing. Based on these results the conclusion is that there is great potential for FSL within the field of multinational address parsing and that general FSL methods can be used and perform well on multinational address parsing tasks. / Adressavkodning är processen att dela upp en adresssträng i dess olika adresskomponenter såsom gatunamn, gatunummer, et cetera. Adressavkodning har undersökts ganska omfattande och det finns några toppmoderna adressavkodningslösningar, mestadels enspråkiga. Senaste åren har forskning fokuserad på multinationell adressavkodning börjat dyka upp och djupa arkitekturer för adressavkodning har använts för att uppnå toppmodern prestation på multinationell adressdata. Att träna dessa arkitekturer kräver dock en ganska stor mängd adressdata, vilket inte alltid är tillgängligt. Det är generellt svårt att få tag på data inom naturlig språkbehandling och majoriteten av den data som är tillgänglig består av data från endast 20 av de cirka 7000 språk som används runt om i världen, så kallade högresursspråk. Detta gäller även för adressdata, vilket kan vara svårt att få tag på för vissa av världens så kallade resurssnåla språk för vilka det finns lite eller ingen data för naturlig språkbehandling. För att försöka behandla denna brist på adressdata för några av världens mindre talade språk undersöker detta projekt om det finns någon potential för inlärning med få exempel för multinationell adressavkodning. För detta implementeras två modeller för överföringsinlärning med få exempel genom finjustering av förtränade språkmodeller. Skillnaden mellan de två modellerna är den förtränade språkmodellen som används, mBERT respektive XLM-R. Båda modellerna finjusteras med hjälp av ett linjärt klassificeringsskikt för att sedan användas som multinationella addressavkodare. De två modellerna tränas och deras resultat jämförs med en toppmodern multinationell adressavkodare, Deepparse. Resultaten visar att de två modellerna presterar båda sämre än Deepparse modellen, men de visar ändå lovande resultat, inte långt ifrån vad Deepparse uppnår för både holdout och zero-shot dataset. Vidare, så presterar båda modeller bra på en blandning av adressdata från låg- och högresursspråk och båda modeller uppnår över 96% övergripande F1-score. Av de två modellerna uppnår XLM-R betydligt bättre resultat än mBERT och kan därför anses vara en mer lämplig förtränad språkmodell att använda för multinationell inlärning med få exempel för addressavkodning. Utifrån dessa resultat dras slutsatsen att det finns stor potential för inlärning med få exempel inom området multinationall adressavkodning, samt att generella metoder för inlärning med få exempel kan användas och preseterar bra på multinationella adressavkodningsuppgifter. Address Parsing Address Segmentation Few-Shot Learning Transfer Learning Named Entity Recognition Adressavkodning Adress Segmentering Inlärning med Få Exempel Överföringsinlärning Computer and Information Sciences Data- och informationsvetenskap
76	Fine-tuning a BERT-based NER Model for Positive Energy Districts Ortega, Karen, Sun, Fei January 2023 (has links) This research presents an innovative approach to extracting information from Positive Energy Districts (PEDs), urban areas generating surplus energy. PEDs are integral to the European Commission's SET Plan, tackling housing challenges arising from population growth. The study refines BERT to categorize PED-related entities, producing a cutting-edge NER model and an integrated pipeline of diverse NER tools and data sources. The model achieves an accuracy of 0.81 and an F1 Score of 0.55 with notably high confidence scores through pipeline evaluations, confirming its practical applicability. While the F1 score falls short of expectations, this pioneering exploration in PED information extraction sets the stage for future refinements and studies, promising enhanced methodologies and impactful outcomes in this dynamic field. This research advances NER processes for Positive Energy Districts, supporting their development and implementation. Positive Energy District (PED) Named Entity Recognition (NER) Pipeline Fine-tune
77	Active Learning for Named Entity Recognition with Swedish Language Models / Aktiv Inlärning för Namnigenkänning med Svenska Språkmodeller Öhman, Joey January 2021 (has links) The recent advancements of Natural Language Processing have cleared the path for many new applications. This is primarily a consequence of the transformer model and the transfer-learning capabilities provided by models like BERT. However, task-specific labeled data is required to fine-tune these models. To alleviate the expensive process of labeling data, Active Learning (AL) aims to maximize the information gained from each label. By including a model in the annotation process, the informativeness of each unlabeled sample can be estimated and hence allow human annotators to focus on vital samples and avoid redundancy. This thesis investigates to what extent AL can accelerate model training with respect to the number of labels required. In particular, the focus is on pre- trained Swedish language models in the context of Named Entity Recognition. The data annotation process is simulated using existing labeled datasets to evaluate multiple AL strategies. Experiments are evaluated by analyzing the F1 score achieved by models trained on the data selected by each strategy. The results show that AL can significantly accelerate the model training and hence reduce the manual annotation effort. The state-of-the-art strategy for sentence classification, ALPS, shows no sign of accelerating the model training. However, uncertainty-based strategies consistently outperform random selection. Under certain conditions, these strategies can reduce the number of labels required by more than a factor of two. / Framstegen som nyligen har gjorts inom naturlig språkbehandling har möjliggjort många nya applikationer. Det är mestadels till följd av transformer-modellerna och lärandeöverföringsmöjligheterna som kommer med modeller som BERT. Däremot behövs det fortfarande uppgiftsspecifik annoterad data för att finjustera dessa modeller. För att lindra den dyra processen att annotera data, strävar aktiv inlärning efter att maximera informationen som utvinns i varje annotering. Genom att inkludera modellen i annoteringsprocessen, kan man estimera hur informationsrikt varje träningsexempel är, och på så sätt låta mänskilga annoterare fokusera på viktiga datapunkter. Detta examensarbete utforskar hur väl aktiv inlärning kan accelerera modellträningen med avseende på hur många annoterade träningsexempel som behövs. Fokus ligger på förtränade svenska språkmodeller och uppgiften namnigenkänning. Dataannoteringsprocessen simuleras med färdigannoterade dataset för att evaluera flera olika strategier för aktiv inlärning. Experimenten evalueras genom att analysera den uppnådda F1-poängen av modeller som är tränade på datapunkterna som varje strategi har valt. Resultaten visar att aktiv inlärning har en signifikant förmåga att accelerera modellträningen och reducera de manuella annoteringskostnaderna. Den toppmoderna strategin för meningsklassificering, ALPS, visar inget tecken på att kunna accelerera modellträningen. Däremot är osäkerhetsbaserade strategier är konsekvent bättre än att slumpmässigt välja datapunkter. I vissa förhållanden kan dessa strategier reducera antalet annoteringar med mer än en faktor 2. Active learning Named entity recognition Language models Natural language processing Bert Swedish Aktiv inlärning Namnigenkänning Språkmodeller Naturlig språkbehandling Bert Svenska Computer and Information Sciences Data- och informationsvetenskap
78	Bootstrapping Annotated Job Ads using Named Entity Recognition and Swedish Language Models / Identifiering av namngivna enheter i jobbannonser genom användning av semi-övervakade tekniker och svenska språkmodeller Nyqvist, Anna January 2021 (has links) Named entity recognition (NER) is a task that concerns detecting and categorising certain information in text. A promising approach for NER that recently has emerged is fine-tuning Transformer-based language models for this specific task. However, these models may require a relatively large quantity of labelled data to perform well. This can limit NER models applicability in real-world applications as manual annotation often is costly and time-consuming. In this thesis, we investigate the learning curve of human annotation and of a NER model during a semi-supervised bootstrapping process. Special emphasis is given the dependence of the number of classes and the amount of training data used in the process. We first annotate a set of collected job advertisements and then apply bootstrapping using both annotated and unannotated data and continuously fine-tune a pre-trained Swedish BERT model. The initial class system is simplified during the bootstrapping process according to model performance and inter-annotator agreement. The model performance increased as the training set grew larger with a final micro F1-score of 54%. This result provides a good baseline, and we point out several improvements that can be made to further enhance performance. We further identify classes handled differently by the annotators and potential factors as to why this is. Suggestions for future work include adjusting the current class system further by removing classes that were identified as low-performing in this thesis. / Namngiven entitetsigenkänning (eng. named entity recognition) innebär att identifiera och kategorisera nyckelord i text. En ny lovande teknik för identifiering av namngivna enheter är att finjustera Transformerbaserade språkmodeller för denna specifika uppgift. Dessa modeller kräver dock stora mängder märkt data för att prestera väl. Detta kan begränsa antal områden i vilka de kan användas då manuell märkning av data ofta är kostsamt och tidskrävande. I denna avhandling undersöker vi inlärningskurvan för manuell annotering och för en språkmodell under en halvövervakad bootstrapping process. Särskild vikt läggs på hur modellens och annoterarnas inlärning påverkas av antal klasser och mängden träningsdata som används i processen. Vi annoterar först en samling jobbannonser och tillämpar sedan en bootstrapping process med både märkt och omärkt data i vilken en förtränad svensk BERT-modell kontinuerligt finjusteras. Det första klasssystemet förenklas under processens gång beroende på modellprestation och interannoterar-överenskommelse. Modellen presterade bättre med mer träningsdata och uppnådde en slutlig micro F1-score på 54%. Detta resultat ger en bra baslinje, och vi föreslår flera förbättringar som kan göras för att ytterligare förbättra modellprestationen. Vidare identifierar vi även klasser som hanteras olika av annoterare och potentiella faktorer till vad detta beror på. Förslag för framtida arbete inkluderar att justera det nuvarande klasssystemet ytterligare genom att ta bort klasser som identifierades som lågpresterande i denna avhandling. labour market named entity recognition natural language processing BERT data annotation inter-annotator agreement arbetsmarknad namngiven entitetsigenkänning naturlig språkbehandling BERT dataannotering interannoterar-överenskommelse Computer and Information Sciences Data- och informationsvetenskap
79	Adaptive Semantic Annotation of Entity and Concept Mentions in Text Mendes, Pablo N. 05 June 2014 (has links) No description available. Computer Science semantic annotation semantic tagging named entity recognition name resolution entity disambiguation entity linking keyphrase extraction word sense disambiguation entity classification entity extraction adaptive flexible
80	Improving a Few-shot Named Entity Recognition Model Using Data Augmentation / Förbättring av en existerande försöksmodell för namnidentifiering med få exempel genom databerikande åtgärder Mellin, David January 2022 (has links) To label words of interest into a predefined set of named entities have traditionally required a large amount of labeled in-domain data. Recently, the availability of pre-trained transformer-based language models have enabled multiple natural language processing problems to utilize transfer learning techniques to construct machine learning models with less task-specific labeled data. In this thesis, the impact of data augmentation when training a pre-trained transformer-based model to adapt to a named entity recognition task with few labeled sentences is explored. The experimental results indicate that data augmentation increases performance of the trained models, however the data augmentation is shown to have less impact when more labeled data is available. In conclusion, data augmentation has been shown to improve performance of pre-trained named entity recognition models when few labeled sentences are available for training. / Att kategorisera ord som tillhör någon av en mängd förangivna entiteter har traditionellt krävt stora mängder förkategoriserad områdesspecifik data. På senare år har det tillgängliggjorts förtränade språkmodeller som möjliggjort för språkprocesseringsproblem att lösas med en mindre mängd områdesspecifik kategoriserad data. I den här uppsatsen utforskas datautöknings påverkan på en maskininlärningsmodell för identifiering av namngivna entiteter. De experimentella resultaten indikerar att datautökning förbättrar modellerna, men att inverkan blir mindre när mer kategoriserad data är tillgänglig. Sammanfattningsvis så kan datautökning förbättra modeller för identifiering av namngivna entiteter när få förkategoriserade meningar finns tillgängliga för träning. Named Entity Recognition Data Augmentation Self-training BERT Few-shot Learning Identifiering av namngivna entiteter Datautökning Självträning BERT Fåförsöksinlärning Computer Sciences Datavetenskap (datalogi)

Search results