Spelling suggestions: "subject:"Relation extraction"" "subject:"Relation axtraction""
1 |
N-ary Cross-sentence Relation Extraction: From Supervised to Unsupervised LearningYuan, Chenhan 19 May 2021 (has links)
Relation extraction is the problem of extracting relations between entities described in the text. Relations identify a common "fact" described by distinct entities. Conventional relation extraction approaches focus on supervised binary intra-sentence relations, where the assumption is relations only exist between two entities within the same sentence. These approaches have two key limitations. First, binary intra-sentence relation extraction methods can not extract a relation in a fact that is described by more than two entities. Second, these methods cannot extract relations that span more than one sentence, which commonly occurs as the number of entities increases. Third, these methods assume a supervised setting and are therefore not able to extract relations in the absence of sufficient labeled data for training. This work aims to overcome these limitations by developing n-ary cross-sentence relation extraction methods for both supervised and unsupervised settings. Our work has three main goals and contributions: (1) two unsupervised binary intra-sentence relation extraction methods, (2) a supervised n-ary cross-sentence relation extraction method, and (3) an unsupervised n-ary cross-sentence relation extraction method. To achieve these goals, our work includes the following contributions: (1) an automatic labeling method for n-ary cross-sentence data, which is essential for model training, (2) a reinforcement learning-based sentence distribution estimator to minimize the impact of noise on model training, (3) a generative clustering-based technique for intra-sentence unsupervised relation extraction, (4) a variational autoencoder-based technique for unsupervised n-ary cross-sentence relation extraction, and (5) a sentence group selector that identifies groups of sentences that form relations. / Master of Science / In this work, we designed multiple models to automatically extract relations from text. These relations represent the semantic connection between two or more proper nouns. Previous work includes models that can only extract relations between two proper nouns in a single sentence, while the methods proposed in this thesis can extract relations between two or more proper nouns in multiple sentences. We propose three models. The first model can automatically remove erroneous annotations in training data, thereby making the models more credible. We also propose a more effective model that can automatically extract relations between two proper nouns in a single sentence without the need for data annotation. We later extend this model so that it can extract relations between two or more proper nouns in multiple sentences.
|
2 |
Towards generic relation extractionHachey, Benjamin January 2009 (has links)
A vast amount of usable electronic data is in the form of unstructured text. The relation extraction task aims to identify useful information in text (e.g., PersonW works for OrganisationX, GeneY encodes ProteinZ) and recode it in a format such as a relational database that can be more effectively used for querying and automated reasoning. However, adapting conventional relation extraction systems to new domains or tasks requires significant effort from annotators and developers. Furthermore, previous adaptation approaches based on bootstrapping start from example instances of the target relations, thus requiring that the correct relation type schema be known in advance. Generic relation extraction (GRE) addresses the adaptation problem by applying generic techniques that achieve comparable accuracy when transferred, without modification of model parameters, across domains and tasks. Previous work on GRE has relied extensively on various lexical and shallow syntactic indicators. I present new state-of-the-art models for GRE that incorporate governordependency information. I also introduce a dimensionality reduction step into the GRE relation characterisation sub-task, which serves to capture latent semantic information and leads to significant improvements over an unreduced model. Comparison of dimensionality reduction techniques suggests that latent Dirichlet allocation (LDA) – a probabilistic generative approach – successfully incorporates a larger and more interdependent feature set than a model based on singular value decomposition (SVD) and performs as well as or better than SVD on all experimental settings. Finally, I will introduce multi-document summarisation as an extrinsic test bed for GRE and present results which demonstrate that the relative performance of GRE models is consistent across tasks and that the GRE-based representation leads to significant improvements over a standard baseline from the literature. Taken together, the experimental results 1) show that GRE can be improved using dependency parsing and dimensionality reduction, 2) demonstrate the utility of GRE for the content selection step of extractive summarisation and 3) validate the GRE claim of modification-free adaptation for the first time with respect to both domain and task. This thesis also introduces data sets derived from publicly available corpora for the purpose of rigorous intrinsic evaluation in the news and biomedical domains.
|
3 |
Attribution : a computational approachPareti, Silvia January 2015 (has links)
Our society is overwhelmed with an ever growing amount of information. Effective management of this information requires novel ways to filter and select the most relevant pieces of information. Some of this information can be associated with the source or sources expressing it. Sources and their relation to what they express affect information and whether we perceive it as relevant, biased or truthful. In news texts in particular, it is common practice to report third-party statements and opinions. Recognizing relations of attribution is therefore a necessary step toward detecting statements and opinions of specific sources and selecting and evaluating information on the basis of its source. The automatic identification of Attribution Relations has applications in numerous research areas. Quotation and opinion extraction, discourse and factuality have all partly addressed the annotation and identification of Attribution Relations. However, disjoint efforts have provided a partial and partly inaccurate picture of attribution. Moreover, these research efforts have generated small or incomplete resources, thus limiting the applicability of machine learning approaches. Existing approaches to extract Attribution Relations have focused on rule-based models, which are limited both in coverage and precision. This thesis presents a computational approach to attribution that recasts attribution extraction as the identification of the attributed text, its source and the lexical cue linking them in a relation. Drawing on preliminary data-driven investigation, I present a comprehensive lexicalised approach to attribution and further refine and test a previously defined annotation scheme. The scheme has been used to create a corpus annotated with Attribution Relations, with the goal of contributing a large and complete resource than can lay the foundations for future attribution studies. Based on this resource, I developed a system for the automatic extraction of attribution relations that surpasses traditional syntactic pattern-based approaches. The system is a pipeline of classification and sequence labelling models that identify and link each of the components of an attribution relation. The results show concrete opportunities for attribution-based applications.
|
4 |
Boosting Supervised Neural Relation Extraction with Distant SupervisionDhyani, Dushyanta, Dhyani 24 August 2018 (has links)
No description available.
|
5 |
Populating the Semantic Web : combining text and relational databases as RDF graphsByrne, Kate January 2009 (has links)
The Semantic Web promises a way of linking distributed information at a granular level by interconnecting compact data items instead of complete HTML pages. New data is gradually being added to the Semantic Web but there is a need to incorporate existing knowledge. This thesis explores ways to convert a coherent body of information from various structured and unstructured formats into the necessary graph form. The transformation work crosses several currently active disciplines, and there are further research questions that can be addressed once the graph has been built. Hybrid databases, such as the cultural heritage one used here, consist of structured relational tables associated with free text documents. Access to the data is hampered by complex schemas, confusing terminology and difficulties in searching the text effectively. This thesis describes how hybrid data can be unified by assembly into a graph. A major component task is the conversion of relational database content to RDF. This is an active research field, to which this work contributes by examining weaknesses in some existing methods and proposing alternatives. The next significant element of the work is an attempt to extract structure automatically from English text using natural language processing methods. The first claim made is that the semantic content of the text documents can be adequately captured as a set of binary relations forming a directed graph. It is shown that the data can then be grounded using existing domain thesauri, by building an upper ontology structure from these. A schema for cultural heritage data is proposed, intended to be generic for that domain and as compact as possible. Another hypothesis is that use of a graph will assist retrieval. The structure is uniform and very simple, and the graph can be queried even if the predicates (or edge labels) are unknown. Additional benefits of the graph structure are examined, such as using path length between nodes as a measure of relatedness (unavailable in a relational database where there is no equivalent concept of locality), and building information summaries by grouping the attributes of nodes that share predicates. These claims are tested by comparing queries across the original and the new data structures. The graph must be able to answer correctly queries that the original database dealt with, and should also demonstrate valid answers to queries that could not previously be answered or where the results were incomplete.
|
6 |
Extracting Temporally-Anchored Spatial KnowledgeVempala, Alakananda 05 1900 (has links)
In my dissertation, I elaborate on the work that I have done to extract temporally-anchored spatial knowledge from text, including both intra- and inter-sentential knowledge. I also detail multiple approaches to infer spatial timeline of a person from biographies and social media. I present and analyze two strategies to annotate information regarding whether a given entity is or is not located at some location, and for how long with respect to an event. Specifically, I leverage semantic roles or syntactic dependencies to generate potential spatial knowledge and then crowdsource annotations to validate the potential knowledge. The resulting annotations indicate how long entities are or are not located somewhere, and temporally anchor this spatial information. I present an in-depth corpus analysis and experiments comparing the spatial knowledge generated by manipulating roles or dependencies. In my work, I also explore research methodologies that go beyond single sentences and extract spatio-temporal information from text. Spatial timelines refer to a chronological order of locations where a target person is or is not located. I present corpus and experiments to extract spatial timelines from Wikipedia biographies. I present my work on determining locations and the order in which they are actually visited by a person from their travel experiences. Specifically, I extract spatio-temporal graphs that capture the order (edges) of locations (nodes) visited by a person. Further, I detail my experiments that leverage both text and images to extract spatial timeline of a person from Twitter.
|
7 |
Unsupervised relation extraction for e-learning applicationsAfzal, Naveed January 2011 (has links)
In this modern era many educational institutes and business organisations are adopting the e-Learning approach as it provides an effective method for educating and testing their students and staff. The continuous development in the area of information technology and increasing use of the internet has resulted in a huge global market and rapid growth for e-Learning. Multiple Choice Tests (MCTs) are a popular form of assessment and are quite frequently used by many e-Learning applications as they are well adapted to assessing factual, conceptual and procedural information. In this thesis, we present an alternative to the lengthy and time-consuming activity of developing MCTs by proposing a Natural Language Processing (NLP) based approach that relies on semantic relations extracted using Information Extraction to automatically generate MCTs. Information Extraction (IE) is an NLP field used to recognise the most important entities present in a text, and the relations between those concepts, regardless of their surface realisations. In IE, text is processed at a semantic level that allows the partial representation of the meaning of a sentence to be produced. IE has two major subtasks: Named Entity Recognition (NER) and Relation Extraction (RE). In this work, we present two unsupervised RE approaches (surface-based and dependency-based). The aim of both approaches is to identify the most important semantic relations in a document without assigning explicit labels to them in order to ensure broad coverage, unrestricted to predefined types of relations. In the surface-based approach, we examined different surface pattern types, each implementing different assumptions about the linguistic expression of semantic relations between named entities while in the dependency-based approach we explored how dependency relations based on dependency trees can be helpful in extracting relations between named entities. Our findings indicate that the presented approaches are capable of achieving high precision rates. Our experiments make use of traditional, manually compiled corpora along with similar corpora automatically collected from the Web. We found that an automatically collected web corpus is still unable to ensure the same level of topic relevance as attained in manually compiled traditional corpora. Comparison between the surface-based and the dependency-based approaches revealed that the dependency-based approach performs better. Our research enabled us to automatically generate questions regarding the important concepts present in a domain by relying on unsupervised relation extraction approaches as extracted semantic relations allow us to identify key information in a sentence. The extracted patterns (semantic relations) are then automatically transformed into questions. In the surface-based approach, questions are automatically generated from sentences matched by the extracted surface-based semantic pattern which relies on a certain set of rules. Conversely, in the dependency-based approach questions are automatically generated by traversing the dependency tree of extracted sentence matched by the dependency-based semantic patterns. The MCQ systems produced from these surface-based and dependency-based semantic patterns were extrinsically evaluated by two domain experts in terms of questions and distractors readability, usefulness of semantic relations, relevance, acceptability of questions and distractors and overall MCQ usability. The evaluation results revealed that the MCQ system based on dependency-based semantic relations performed better than the surface-based one. A major outcome of this work is an integrated system for MCQ generation that has been evaluated by potential end users.
|
8 |
Approches supervisées et faiblement supervisées pour l’extraction d’événements et le peuplement de bases de connaissances / Supervised and weakly-supervised approaches for complex-event extraction and knowledge base populationJean-Louis, Ludovic 15 December 2011 (has links)
La plus grande partie des informations disponibles librement sur le Web se présentent sous une forme textuelle, c'est-à-dire non-structurée. Dans un contexte comme celui de la veille, il est très utile de pouvoir présenter les informations présentes dans les textes sous une forme structurée en se focalisant sur celles jugées pertinentes vis-à-vis du domaine d'intérêt considéré. Néanmoins, lorsque l'on souhaite traiter ces informations de façon systématique, les méthodes manuelles ne sont pas envisageables du fait du volume important des données à considérer.L'extraction d'information s'inscrit dans la perspective de l'automatisation de ce type de tâches en identifiant dans des textes les informations concernant des faits (ou événements) afin de les stocker dans des structures de données préalablement définies. Ces structures, appelées templates (ou formulaires), agrègent les informations caractéristiques d'un événement ou d'un domaine d'intérêt représentées sous la forme d'entités nommées (nom de lieux, etc.).Dans ce contexte, le travail de thèse que nous avons mené s'attache à deux grandes problématiques : l'identification des informations liées à un événement lorsque ces informations sont dispersées à une échelle textuelle en présence de plusieurs occurrences d'événements de même type;la réduction de la dépendance vis-à-vis de corpus annotés pour la mise en œuvre d'un système d'extraction d'information.Concernant la première problématique, nous avons proposé une démarche originale reposant sur deux étapes. La première consiste en une segmentation événementielle identifiant dans un document les zones de texte faisant référence à un même type d'événements, en s'appuyant sur des informations de nature temporelle. Cette segmentation détermine ainsi les zones sur lesquelles le processus d'extraction doit se focaliser. La seconde étape sélectionne à l'intérieur des segments identifiés comme pertinents les entités associées aux événements. Elle conjugue pour ce faire une extraction de relations entre entités à un niveau local et un processus de fusion global aboutissant à un graphe d'entités. Un processus de désambiguïsation est finalement appliqué à ce graphe pour identifier l'entité occupant un rôle donné vis-à-vis d'un événement lorsque plusieurs sont possibles.La seconde problématique est abordée dans un contexte de peuplement de bases de connaissances à partir de larges ensembles de documents (plusieurs millions de documents) en considérant un grand nombre (une quarantaine) de types de relations binaires entre entités nommées. Compte tenu de l'effort représenté par l'annotation d'un corpus pour un type de relations donné et du nombre de types de relations considérés, l'objectif est ici de s'affranchir le plus possible du recours à une telle annotation tout en conservant une approche par apprentissage. Cet objectif est réalisé par le biais d'une approche dite de supervision distante prenant comme point de départ des exemples de relations issus d'une base de connaissances et opérant une annotation non supervisée de corpus en fonction de ces relations afin de constituer un ensemble de relations annotées destinées à la construction d'un modèle par apprentissage. Cette approche a été évaluée à large échelle sur les données de la campagne TAC-KBP 2010. / The major part of the information available on the web is provided in textual form, i.e. in unstructured form. In a context such as technology watch, it is useful to present the information extracted from a text in a structured form, reporting only the pieces of information that are relevant to the considered field of interest. Such processing cannot be performed manually at large scale, given the large amount of data available. The automated processing of this task falls within the Information extraction (IE) domain.The purpose of IE is to identify, within documents, pieces of information related to facts (or events) in order to store this information in predefined data structures. These structures, called templates, aggregate fact properties - often represented by named entities - concerning an event or an area of interest.In this context, the research performed in this thesis addresses two problems:identifying information related to a specific event, when the information is scattered across a text and several events of the same type are mentioned in the text;reducing the dependency to annotated corpus for the implementation of an Information Extraction system.Concerning the first problem, we propose an original approach that relies on two steps. The first step operates an event-based text segmentation, which identifies within a document the text segments on which the IE process shall focus to look for the entities associated with a given event. The second step focuses on template filling and aims at selecting, within the segments identified as relevant by the event-based segmentation, the entities that should be used as fillers, using a graph-based method. This method is based on a local extraction of relations between entities, that are merged in a relation graph. A disambiguation step is then performed on the graph to identify the best candidates to fill the information template.The second problem is treated in the context of knowledge base (KB) population, using a large collection of texts (several millions) from which the information is extracted. This extraction also concerns a large number of relation types (more than 40), which makes the manual annotation of the collection too expensive. We propose, in this context, a distant supervision approach in order to use learning techniques for this extraction, without the need of a fully annotated corpus. This distant supervision approach uses a set of relations from an existing KB to perform an unsupervised annotation of a collection, from which we learn a model for relation extraction. This approach has been evaluated at a large scale on the data from the TAC-KBP 2010 evaluation campaign.
|
9 |
Genealogy Extraction and Tree Generation from Free Form TextChu, Timothy Sui-Tim 01 December 2017 (has links)
Genealogical records play a crucial role in helping people to discover their lineage and to understand where they come from. They provide a way for people to celebrate their heritage and to possibly reconnect with family they had never considered. However, genealogical records are hard to come by for ordinary people since their information is not always well established in known databases. There often is free form text that describes a person’s life, but this must be manually read in order to extract the relevant genealogical information. In addition, multiple texts may have to be read in order to create an extensive tree. This thesis proposes a novel three part system which can automatically interpret free form text to extract relationships and produce a family tree compliant with GED- COM formatting. The first subsystem builds an extendable database of genealogical records that are systematically extracted from free form text. This corpus provides the tagged data for the second subsystem, which trains a Naı̈ve Bayes classifier to predict relationships from free form text by examining the types of relationships for pairs of entities and their associated feature vectors. The last subsystem accumulates extracted relationships into family trees. When a multiclass Naı̈ve Bayes classifier is used, the proposed system achieves an accuracy of 54%. When binary Naı̈ve Bayes classifiers are used, the proposed system achieves accuracies of 69% for the child to parent relationship classifier, 75% for the spousal relationship classifier, and 73% for the sibling relationship classifier.
|
10 |
Exploring Data Extraction and Relation Identification Using Machine Learning : Utilizing Machine-Learning Techniques to Extract Relevant Information from Climate ReportsBerger, William, Fooladi, Alex, Lindgren, Markus, Messo, Michel, Rosengren, Jonas, Rådmann, Lukas January 2023 (has links)
Ensuring the accessibility of data from Swedish municipal climate reports is necessary for examining climate work in Sweden. Manual data extraction is time-consuming and prone to errors, necessitating automation of the process. This project presents machine-learning techniques that can be used to extract data and information from Swedish municipal climate plans, to improve the accessibility of climate data. The proposed solution involves recognizing entities in plain text and extracting predefined relations between these using Named Entity Recognition and Relation Extraction, respectively. The result of the project is a functioning prototype in the medical domain due to the lack of annotated climate datasets in Swedish. Nevertheless, the problem remained the same: how to effectively perform data extraction from reports using machine learning techniques. The presented prototype demonstrates the potential of automating data extraction from reports. These findings imply that the system could be adapted to handle climate reports when a sufficient dataset becomes available. / Tillgängliggörande av information som sammanställs i svenska kommunala klimatplaner är avgörande för att utvärdera och ifrågasätta klimatarbetet i Sverige. Manuell dataextraktion är tidskrävande och komplicerat, vilket understryker behovet av att automatisera processen. Detta projekt utforskar maskininlärningstekniker som kan användas för att extrahera data och information från de kommunala klimatplanerna. Den föreslagna lösningen utnyttjar Named Entity Recognition för att identifiera entiteter i text och Relation Extraction för att extrahera fördefinierade relationer mellan entiteterna. I brist på svenska annoterade dataset inom klimatdomänen, är resultatet av projektet en fungerande prototyp inom den medicinska domänen. Frågeställningen är således densamma, hur maskininlärning kan användas för att utföra dataextraktion på rapporter. Prototypen som presenteras visar potentialen i att automatisera denna typ av dataextrahering. Denna framgång antyder att modellen kan anpassas för att hantera klimatrapporter när ett adekvat dataset blir tillgängligt.
|
Page generated in 0.1386 seconds