Global ETD Search

1	Towards Generalizable Information Extraction with Limited Supervision Wang, Sijia 18 September 2024 (has links) Supervised approaches, especially those employing deep neural networks, have showcased impressive performance, relying on a significant volume of manual annotations. However, their effectiveness encounters challenges when attempting to generalize to new languages, domains, or types, particularly in the absence of sufficient annotations. Current methods fall short in effectively addressing information extraction (IE) under limited supervision. In this dissertation, we approach information extraction with limited supervision from three perspectives. Firstly, we refine the previous classification-based extraction paradigm by introducing a query-and-extract framework, which uses target information as natural language queries to extract candidate information from the input text. Additionally, we leverage the excellent generation capability of large language models (LLMs) to produce high-quality annotation data, enriching IE semantics within limited annotation data. We also utilize LLMs' instruction-following capability to iteratively refine and optimize solutions through a debating process. Beyond text-only IE, we define a new multimodal IE task that links an entity mention within heterogeneous information sources to a knowledge base with limited annotation data. We demonstrate that excellent multimodal IE performance can be achieved, even with limited annotation data, by leveraging monomodal external information. These combined efforts aim to make optimal use of limited knowledge, ensuring more robust and generalizable solutions. / Doctor of Philosophy / This dissertation explores the development of information extraction (IE) algorithms and systems that work effectively with limited supervision. Information extraction is a complex and challenging task that involves extracting structured data from plain text. Traditional IE systems are often tailored to specific tasks and domains where ample annotated data is available, limiting their ability to adapt to new domains. This research focuses on developing IE systems that can generalize to new domains with limited supervision, reducing the reliance on extensive annotations. The proposed solutions demonstrate the potential to transfer knowledge from existing annotations to new tasks and domains, emphasizing the importance of learning from limited data and improving knowledge transfer to previously unknown domains. Information Extraction Limited Supervision Event Extraction Entity Linking
2	Découverte et réconciliation de données numeriques relatives aux personnes pour la gestion des ressources humaines / Digital Identity Discovery and Reconciliation for Human Resources Management Ghufran, Mohammad 27 November 2017 (has links) La gestion des ressources humaines est une tâche importante pour toutes les organisations. Avec le nombre de candidatures en augmentation grâce à plusieurs plateformes en ligne, il est souhaitable de faire correspondre automatiquement les candidats avec des offres d’emploi. Les approches existantes utilisent les CVs sans compléter les informations par des recherches sur le Web, notamment le Web social. L’objectif de cette thèse est de surmonter cette limitation et proposer des méthodes pour découvrir des ressources en ligne pertinentes pour un demandeur d’emploi. À cet égard, une nouvelle méthode pour l’extraction d’informations clés à partir des CVs est proposée. Il s’agit d’un problème difficile puisque les CVs peuvent être multilingues et avoir des structures assez variées. En plus, les entités présentes sont suivant ambiguës. L’identification et la réconciliation des ressources en ligne en utilisant les informations clés sont un autre défi. Nous proposons un algorithme pour générer des requêtes et classer les résultats pour obtenir les ressources en ligne les plus pertinentes pour un demandeur d’emploi.. En outre, nous abordons spécifiquement la réconciliation de profils dans les réseaux sociaux grâce à une méthode qui est capable d’identifier les profils de individus à travers différents réseaux. Cette méthode utilise notamment les informations relatives à la localisation géographique des profils. A cet égard, nous proposons un algorithme permettant de désambiguïser les toponymes utilisés dans les profils pour indiquer une localité géographique ; cet algorithme peut être également utilisé pour inférer la localité d’un individu lorsqu’il ne l’a pas renseignée. Des expériences sur des ensembles de données réelles sont menées pour tous les différents algorithmes proposés dans cette thèse qui montrent de bons résultats. / Finding the appropriate individual to hire is a crucial part of any organization. With the number of applications increasing due to the introduction of online job portals, it is desired to automatically match applicants with job offers. Existing approaches that match applicants with job offers take resumes as they are and do not attempt to complete the information on a resume by looking for more information on the Internet. The objective of this thesis is to fill this gap by discovering online resources pertinent to an applicant. To this end, a novel method for extraction of key information from resumes is proposed. This is a challenging task since resumes can have diverse structures and formats, and the entities present within are ambiguous. Identification of Web results using the key information and their reconciliation is another challenge. We propose an algorithm to generate queries, and rank the results to obtain the most pertinent online resources. In addition, we specifically tackle reconciliation of social network profiles through a method that is able to identify profiles of individuals across different networks. Moreover, a method to resolve ambiguity in locations, or predict it when absent, is also presented. Experiments on real data sets are conducted for all the different algorithms proposed in this thesis and they show good results. Extraction d’Information Annotation sémantique Réconciliation Désambiguïsation Information Extraction Entity Linking Reconciliation, Disambiguation
3	Rozpoznávání a propojování pojmenovaných entit / Named Entity Recognition and Linking Taufer, Pavel January 2017 (has links) The goal of this master thesis is to design and implement a named entity recognition and linking algorithm. A part of this goal is to propose and create a knowledge base that will be used in the algorithm. Because of the limited amount of data for languages other than English, we want to be able to train our method on one language, and then transfer the learned parameters to other languages (that do not have enough training data). The thesis consists of description of available knowledge bases, existing methods and design and implementation of our own knowledge base and entity linking method. Our method achieves state of the art result on a few variants of the AIDA CoNLL-YAGO dataset. The method also obtains comparable results on a sample of Czech annotated data from the PDT dataset using the parameters trained on the English CoNLL dataset. Powered by TCPDF (www.tcpdf.org)
4	Advanced Methods for Entity Linking in the Life Sciences Christen, Victor 25 January 2021 (has links) The amount of knowledge increases rapidly due to the increasing number of available data sources. However, the autonomy of data sources and the resulting heterogeneity prevent comprehensive data analysis and applications. Data integration aims to overcome heterogeneity by unifying different data sources and enriching unstructured data. The enrichment of data consists of different subtasks, amongst other the annotation process. The annotation process links document phrases to terms of a standardized vocabulary. Annotated documents enable effective retrieval methods, comparability of different documents, and comprehensive data analysis, such as finding adversarial drug effects based on patient data. A vocabulary allows the comparability using standardized terms. An ontology can also represent a vocabulary, whereas concepts, relationships, and logical constraints additionally define an ontology. The annotation process is applicable in different domains. Nevertheless, there is a difference between generic and specialized domains according to the annotation process. This thesis emphasizes the differences between the domains and addresses the identified challenges. The majority of annotation approaches focuses on the evaluation of general domains, such as Wikipedia. This thesis evaluates the developed annotation approaches with case report forms that are medical documents for examining clinical trials. The natural language provides different challenges, such as similar meanings using different phrases. The proposed annotation method, AnnoMap, considers the fuzziness of natural language. A further challenge is the reuse of verified annotations. Existing annotations represent knowledge that can be reused for further annotation processes. AnnoMap consists of a reuse strategy that utilizes verified annotations to link new documents to appropriate concepts. Due to the broad spectrum of areas in the biomedical domain, different tools exist. The tools perform differently regarding a particular domain. This thesis proposes a combination approach to unify results from different tools. The method utilizes existing tool results to build a classification model that can classify new annotations as correct or incorrect. The results show that the reuse and the machine learning-based combination improve the annotation quality compared to existing approaches focussing on the biomedical domain. A further part of data integration is entity resolution to build unified knowledge bases from different data sources. A data source consists of a set of records characterized by attributes. The goal of entity resolution is to identify records representing the same real-world entity. Many methods focus on linking data sources consisting of records being characterized by attributes. Nevertheless, only a few methods can handle graph-structured knowledge bases or consider temporal aspects. The temporal aspects are essential to identify the same entities over different time intervals since these aspects underlie certain conditions. Moreover, records can be related to other records so that a small graph structure exists for each record. These small graphs can be linked to each other if they represent the same. This thesis proposes an entity resolution approach for census data consisting of person records for different time intervals. The approach also considers the graph structure of persons given by family relationships. For achieving qualitative results, current methods apply machine-learning techniques to classify record pairs as the same entity. The classification task used a model that is generated by training data. In this case, the training data is a set of record pairs that are labeled as a duplicate or not. Nevertheless, the generation of training data is a time-consuming task so that active learning techniques are relevant for reducing the number of training examples. The entity resolution method for temporal graph-structured data shows an improvement compared to previous collective entity resolution approaches. The developed active learning approach achieves comparable results to supervised learning methods and outperforms other limited budget active learning methods. Besides the entity resolution approach, the thesis introduces the concept of evolution operators for communities. These operators can express the dynamics of communities and individuals. For instance, we can formulate that two communities merged or split over time. Moreover, the operators allow observing the history of individuals. Overall, the presented annotation approaches generate qualitative annotations for medical forms. The annotations enable comprehensive analysis across different data sources as well as accurate queries. The proposed entity resolution approaches improve existing ones so that they contribute to the generation of qualitative knowledge graphs and data analysis tasks. info:eu-repo/classification/ddc/000 ddc:000
5	Exploring Emerging Entities and Named Entity Disambiguation in News Articles / Utforskande av Framväxande Entiteter och Disambiguering av Entiteter i Nyhetsartiklar Ellgren, Robin January 2020 (has links) Publicly editable knowledge bases such as Wikipedia and Wikidata have over the years grown tremendously in size. Despite the quick growth, they can never be fully complete due to the continuous stream of events happening in the world. In the task of Entity Linking, it is attempted to link mentions of objects in a document to its respective corresponding entries in a knowledge base. However, due to the incompleteness of knowledge bases, new or emerging entities cannot be linked. Attempts to solve this issue have created the field referred to as Emerging Entities. Recent state-of-the-art work has addressed the issue with promising results in English. In this thesis, the previous work is examined by evaluating its method in the context of a much smaller language; Swedish. The results reveal an expected drop in overall performance although remaining relative competitiveness. This indicates that the method is a feasible approach to the problem of Emerging Entities even for much less used languages. Due to limitations in the scope of the related work, this thesis also suggests a method for evaluating the accuracy of how the Emerging Entities are modeled in a knowledge base. The study also provides a comprehensive look into the landscape of Emerging Entities and suggests further improvements. NLP Emerging Entities NED NER Entity Linking Computer Sciences Datavetenskap (datalogi)
6	Extracting Salient Named Entities from Financial News Articles / Extrahering av centrala entiteter från finansiella nyhetsartiklar Grönberg, David January 2021 (has links) This thesis explores approaches for extracting company mentions from financial newsarticles that carry a central role in the news. The thesis introduces the task of salient named entity extraction (SNEE): extract all salient named entity mentions in a text document. Moreover, a neural sequence labeling approach is explored to address the SNEE task in an end-to-end fashion, both using a single-task and a multi-task learning setup. In order to train the models, a new procedure for automatically creating SNEE annotations for an existing news article corpus is explored. The neural sequence labeling approaches are compared against a two-stage approach utilizing NLP parsers, a knowledge base and a salience classifier. Textual features inspired from related work in salient entity detection are evaluated to determine what combination of features results in the highest performance on the SNEE task when used by a salience classifier. The experiments show that the difference in performance between the two-stage approach and the best performing sequence labeling approach is marginal, demonstrating the potential of the end-to-end sequence labeling approach on the SNEE task. named entity recognition named entity linking natural language processing Computer Sciences Datavetenskap (datalogi)
7	Concept-based and relation-based corpus navigation : applications of natural language processing in digital humanities / Navigation en corpus fondée sur les concepts et les relations : applications du traitement automatique des langues aux humanités numériques Ruiz Fabo, Pablo 23 June 2017 (has links) La recherche en Sciences humaines et sociales repose souvent sur de grandes masses de données textuelles, qu'il serait impossible de lire en détail. Le Traitement automatique des langues (TAL) peut identifier des concepts et des acteurs importants mentionnés dans un corpus, ainsi que les relations entre eux. Ces informations peuvent fournir un aperçu du corpus qui peut être utile pour les experts d'un domaine et les aider à identifier les zones du corpus pertinentes pour leurs questions de recherche. Pour annoter automatiquement des corpus d'intérêt en Humanités numériques, les technologies TAL que nous avons appliquées sont, en premier lieu, le liage d'entités (plus connu sous le nom de Entity Linking), pour identifier les acteurs et concepts du corpus ; deuxièmement, les relations entre les acteurs et les concepts ont été déterminées sur la base d'une chaîne de traitements TAL, qui effectue un étiquetage des rôles sémantiques et des dépendances syntaxiques, entre autres analyses linguistiques. La partie I de la thèse décrit l'état de l'art sur ces technologies, en soulignant en même temps leur emploi en Humanités numériques. Des outils TAL génériques ont été utilisés. Comme l'efficacité des méthodes de TAL dépend du corpus d'application, des développements ont été effectués, décrits dans la partie II, afin de mieux adapter les méthodes d'analyse aux corpus dans nos études de cas. La partie II montre également une évaluation intrinsèque de la technologie développée, avec des résultats satisfaisants. Les technologies ont été appliquées à trois corpus très différents, comme décrit dans la partie III. Tout d'abord, les manuscrits de Jeremy Bentham, un corpus de philosophie politique des 18e et 19e siècles. Deuxièmement, le corpus PoliInformatics, qui contient des matériaux hétérogènes sur la crise financière américaine de 2007--2008. Enfin, le Bulletin des Négociations de la Terre (ENB dans son acronyme anglais), qui couvre des sommets internationaux sur la politique climatique depuis 1995, où des traités comme le Protocole de Kyoto ou les Accords de Paris ont été négociés. Pour chaque corpus, des interfaces de navigation ont été développées. Ces interfaces utilisateur combinent les réseaux, la recherche en texte intégral et la recherche structurée basée sur des annotations TAL. À titre d'exemple, dans l'interface pour le corpus ENB, qui couvre des négociations en politique climatique, des recherches peuvent être effectuées sur la base d'informations relationnelles identifiées dans le corpus: les acteurs de la négociation ayant discuté un sujet concret en exprimant leur soutien ou leur opposition peuvent être recherchés. Le type de la relation entre acteurs et concepts est exploité, au-delà de la simple co-occurrence entre les termes du corpus. Les interfaces ont été évaluées qualitativement avec des experts de domaine, afin d'estimer leur utilité potentielle pour la recherche dans leurs domaines respectifs. Tout d'abord, il a été vérifié si les représentations générées pour le contenu des corpus sont en accord avec les connaissances des experts du domaine, pour déceler des erreurs d'annotation. Ensuite, nous avons essayé de déterminer si les experts pourraient être en mesure d'avoir une meilleure compréhension du corpus grâce à avoir utilisé les applications, par exemple, s'ils ont trouvé de l'évidence nouvelle pour leurs questions de recherche existantes, ou s'ils ont trouvé de nouvelles questions de recherche. On a pu mettre au jour des exemples où un gain de compréhension sur le corpus est observé grâce à l'interface dédiée au Bulletin des Négociations de la Terre, ce qui constitue une bonne validation du travail effectué dans la thèse. En conclusion, les points forts et faiblesses des applications développées ont été soulignés, en indiquant de possibles pistes d'amélioration en tant que travail futur. / Social sciences and Humanities research is often based on large textual corpora, that it would be unfeasible to read in detail. Natural Language Processing (NLP) can identify important concepts and actors mentioned in a corpus, as well as the relations between them. Such information can provide an overview of the corpus useful for domain-experts, and help identify corpus areas relevant for a given research question. To automatically annotate corpora relevant for Digital Humanities (DH), the NLP technologies we applied are, first, Entity Linking, to identify corpus actors and concepts. Second, the relations between actors and concepts were determined based on an NLP pipeline which provides semantic role labeling and syntactic dependencies among other information. Part I outlines the state of the art, paying attention to how the technologies have been applied in DH.Generic NLP tools were used. As the efficacy of NLP methods depends on the corpus, some technological development was undertaken, described in Part II, in order to better adapt to the corpora in our case studies. Part II also shows an intrinsic evaluation of the technology developed, with satisfactory results. The technologies were applied to three very different corpora, as described in Part III. First, the manuscripts of Jeremy Bentham. This is a 18th-19th century corpus in political philosophy. Second, the PoliInformatics corpus, with heterogeneous materials about the American financial crisis of 2007-2008. Finally, the Earth Negotiations Bulletin (ENB), which covers international climate summits since 1995, where treaties like the Kyoto Protocol or the Paris Agreements get negotiated.For each corpus, navigation interfaces were developed. These user interfaces (UI) combine networks, full-text search and structured search based on NLP annotations. As an example, in the ENB corpus interface, which covers climate policy negotiations, searches can be performed based on relational information identified in the corpus: the negotiation actors having discussed a given issue using verbs indicating support or opposition can be searched, as well as all statements where a given actor has expressed support or opposition. Relation information is employed, beyond simple co-occurrence between corpus terms.The UIs were evaluated qualitatively with domain-experts, to assess their potential usefulness for research in the experts' domains. First, we payed attention to whether the corpus representations we created correspond to experts' knowledge of the corpus, as an indication of the sanity of the outputs we produced. Second, we tried to determine whether experts could gain new insight on the corpus by using the applications, e.g. if they found evidence unknown to them or new research ideas. Examples of insight gain were attested with the ENB interface; this constitutes a good validation of the work carried out in the thesis. Overall, the applications' strengths and weaknesses were pointed out, outlining possible improvements as future work. Liage d’entité Entity Linking Wikification Extraction de relations Extraction de propositions Visualisation de corpus Navigation en corpus Traitement automatique des langues Humanités numériques Entity Linking Wikification Relation extraction Proposition extraction Corpus visualization Corpus navigation Evaluation Natural language processing Digital humanities 410
8	La gestion des données d'autorité archivistiques dans le cadre du Web de données Chardonnens, Anne 15 December 2020 (has links) (PDF) Dans un contexte archivistique en transition, marqué par l'évolution des normes internationales de description archivistique et le passage vers une logique de graphes d'entités, cette thèse se concentre plus spécifiquement sur la gestion des données d'autorité relatives à des personnes physiques. Elle vise à explorer comment le secteur des archives peut bénéficier du développement du Web de données pour favoriser une gestion soutenable de ses données d'autorité :de leur création à leur mise à disposition, en passant par leur maintenance et leur interconnexion avec d'autres ressources.La première partie de la thèse est dédiée à un état de l'art englobant tant les récentes évolutions des normes internationales de description archivistique que le développement de l'écosystème Wikibase. La seconde partie vise à analyser les possibilités et les limites d'une approche faisant appel au logiciel libre Wikibase. Cette seconde partie s'appuie sur une étude empirique menée dans le contexte du Centre d'Études et de Documentation Guerre et Sociétés Contemporaines (CegeSoma). Elle permet de tester les perspectives dont disposent des institutions possédant des ressources limitées et n'ayant pas encore adopté la logique du Web de données. Par le biais de jeux de données relatifs à des personnes liées à la Seconde Guerre mondiale, elle dissèque les différentes étapes conduisant à leur publication sous forme de données ouvertes et liées. L'expérience menée en seconde partie de thèse montre comment une base de connaissance mue par un logiciel tel que Wikibase rationalise la création de données d'autorité structurées multilingues. Des exemples illustrent la façon dont ces entités peuvent ensuite être réutilisées et enrichies à l'aide de données externes dans le cadre d'interfaces destinées au grand public. Tout en soulignant les limites propres à l'utilisation de Wikibase, cette thèse met en lumière ses possibilités, en particulier dans le cadre de la maintenance des données. Grâce à son caractère empirique et aux recommandations qu'elle formule, cette recherche contribue ainsi aux efforts et réflexions menés dans le cadre de la transition des métadonnées archivistiques. / The subject of this thesis is the management of authority records for persons. The research was conducted in an archival context in transition, which was marked by the evolution of international standards of archival description and a shift towards the application of knowledge graphs. The aim of this thesis is to explore how the archival sector can benefit from the developments concerning Linked Data in order to ensure the sustainable management of authority records. Attention is not only devoted to the creation of the records and how they are made available but also to their maintenance and their interlinking with other resources.The first part of this thesis addresses the state of the art of the developments concerning the international standards of archival description as well as those regarding the Wikibase ecosystem. The second part presents an analysis of the possibilities and limits associated with an approach in which the free software Wikibase is used. The analysis is based on an empirical study carried out with data of the Study and Documentation Centre War and Contemporary Society (CegeSoma). It explores the options that are available to institutions that have limited resources and that have not yet implemented Linked Data. Datasets that contain information of people linked to the Second World War were used to examine the different stages involved in the publication of data as Linked Open Data.The experiment carried out in the second part of the thesis shows how a knowledge base driven by software such as Wikibase streamlines the creation of multilingual structured authority data. Examples illustrate how these entities can then be reused and enriched by using external data in interfaces aimed at the general public. This thesis highlights the possibilities of Wikibase, particularly in the context of data maintenance, without ignoring the limitations associated with its use. Due to its empirical nature and the formulated recommendations, this thesis contributes to the efforts and reflections carried out within the framework of the transition of archival metadata. / Doctorat en Information et communication / info:eu-repo/semantics/nonPublished Information et communication linked data Web sémantique archivistique maintenance métadonnées entity linking authority record
9	Entity Information Extraction using Structured and Semi-structured resources Sil, Avirup January 2014 (has links) Among all the tasks that exist in Information Extraction, Entity Linking, also referred to as entity disambiguation or entity resolution, is a new and important problem which has recently caught the attention of a lot of researchers in the Natural Language Processing (NLP) community. The task involves linking/matching a textual mention of a named-entity (like a person or a movie-name) to an appropriate entry in a database (e.g. Wikipedia or IMDB). If the database does not contain the entity it should return NIL (out-of-database) value. Existing techniques for linking named entities in text mostly focus on Wikipedia as a target catalog of entities. Yet for many types of entities, such as restaurants and cult movies, relational databases exist that contain far more extensive information than Wikipedia. In this dissertation, we introduce a new framework, called Open-Database Entity Linking (Open-DB EL), in which a system must be able to resolve named entities to symbols in an arbitrary database, without requiring labeled data for each new database. In experiments on two domains, our Open-DB EL strategies outperform a state-of-the-art Wikipedia EL system by over 25% in accuracy. Existing approaches typically perform EL using a pipeline architecture: they use a Named-Entity Recognition (NER) system to find the boundaries of mentions in text, and an EL system to connect the mentions to entries in structured or semi-structured repositories like Wikipedia. However, the two tasks are tightly coupled, and each type of system can benefit significantly from the kind of information provided by the other. We propose and develop a joint model for NER and EL, called NEREL, that takes a large set of candidate mentions from typical NER systems and a large set of candidate entity links from EL systems, and ranks the candidate mention-entity pairs together to make joint predictions. In NER and EL experiments across three datasets, NEREL significantly outperforms or comes close to the performance of two state-of-the-art NER systems, and it outperforms 6 competing EL systems. On the benchmark MSNBC dataset, NEREL, provides a 60% reduction in error over the next best NER system and a 68% reduction in error over the next-best EL system. We also extend the idea of using semi-structured resources to a relatively less explored area of entity information extraction. Most previous work on information extraction from text has focused on named-entity recognition, entity linking, and relation extraction. Much less attention has been paid to extracting the temporal scope for relations between named-entities; for example, the relation president-Of (John F. Kennedy, USA) is true only in the time-frame (January 20, 1961 - November 22, 1963). In this dissertation we present a system for temporal scoping of relational facts, called TSRF which is trained on distant supervision based on the largest semi-structured resource available: Wikipedia. TSRF employs language models consisting of patterns automatically bootstrapped from sentences collected from Wikipedia pages that contain the main entity of a page and slot-fillers extracted from the infobox tuples. This proposed system achieves state-of-the-art results on 6 out of 7 relations on the benchmark Text Analysis Conference (TAC) 2013 dataset for the task of temporal slot filling (TSF). Overall, the system outperforms the next best system that participated in the TAC evaluation by 10 points on the TAC-TSF evaluation metric. / Computer and Information Science Computer Science Information Science Computational Linguistics Entity Linking Machine Learning Named-entity Recognition Natural Language Processing Text Mining
10	Pojmenované entity a ontologie metodami hlubokého učení / Pojmenované entity a ontologie metodami hlubokého učení Rafaj, Filip January 2021 (has links) In this master thesis we describe a method for linking named entities in a given text to a knowledge base - Named Entity Linking. Using a deep neural architecture together with BERT contextualized word embeddings we created a semi-supervised model that jointly performs Named Entity Recognition and Named Entity Disambiguation. The model outputs a Wikipedia ID for each entity detected in an input text. To compute contextualized word embeddings we used pre-trained BERT without making any changes to it (no fine-tuning). We experimented with components of our model and various versions of BERT embeddings. Moreover, we tested several different ways of using the contextual embeddings. Our model is evaluated using standard metrics and surpasses scores of models that were establishing the state of the art before the expansion of pre-trained contextualized models. The scores of our model are comparable to current state-of-the-art models.

Search results