1 |
Unsupervised Information Extraction From Text - Extraction and Clustering of Relations between EntitiesWang, Wei 16 May 2013 (has links) (PDF)
Unsupervised information extraction in open domain gains more and more importance recently by loosening the constraints on the strict definition of the extracted information and allowing to design more open information extraction systems. In this new domain of unsupervised information extraction, this thesis focuses on the tasks of extraction and clustering of relations between entities at a large scale. The objective of relation extraction is to discover unknown relations from texts. A relation prototype is first defined, with which candidates of relation instances are initially extracted with a minimal criterion. To guarantee the validity of the extracted relation instances, a two-step filtering procedures is applied: the first step with filtering heuristics to remove efficiently large amount of false relations and the second step with statistical models to refine the relation candidate selection. The objective of relation clustering is to organize extracted relation instances into clusters so that their relation types can be characterized by the formed clusters and a synthetic view can be offered to end-users. A multi-level clustering procedure is design, which allows to take into account the massive data and diverse linguistic phenomena at the same time. First, the basic clustering groups similar relation instances by their linguistic expressions using only simple similarity measures on a bag-of-word representation for relation instances to form high-homogeneous basic clusters. Second, the semantic clustering aims at grouping basic clusters whose relation instances share the same semantic meaning, dealing with more particularly phenomena such as synonymy or more complex paraphrase. Different similarities measures, either based on resources such as WordNet or distributional thesaurus, at the level of words, relation instances and basic clusters are analyzed. Moreover, a topic-based relation clustering is proposed to consider thematic information in relation clustering so that more precise semantic clusters can be formed. Finally, the thesis also tackles the problem of clustering evaluation in the context of unsupervised information extraction, using both internal and external measures. For the evaluations with external measures, an interactive and efficient way of building reference of relation clusters proposed. The application of this method on a newspaper corpus results in a large reference, based on which different clustering methods are evaluated.
|
2 |
Unsupervised Information Extraction From Text - Extraction and Clustering of Relations between EntitiesWang, Wei 16 May 2013 (has links) (PDF)
Unsupervised information extraction in open domain gains more and more importance recently by loosening the constraints on the strict definition of the extracted information and allowing to design more open information extraction systems. In this new domain of unsupervised information extraction, this thesis focuses on the tasks of extraction and clustering of relations between entities at a large scale. The objective of relation extraction is to discover unknown relations from texts. A relation prototype is first defined, with which candidates of relation instances are initially extracted with a minimal criterion. To guarantee the validity of the extracted relation instances, a two-step filtering procedures is applied: the first step with filtering heuristics to remove efficiently large amount of false relations and the second step with statistical models to refine the relation candidate selection. The objective of relation clustering is to organize extracted relation instances into clusters so that their relation types can be characterized by the formed clusters and a synthetic view can be offered to end-users. A multi-level clustering procedure is design, which allows to take into account the massive data and diverse linguistic phenomena at the same time. First, the basic clustering groups similar relation instances by their linguistic expressions using only simple similarity measures on a bag-of-word representation for relation instances to form high-homogeneous basic clusters. Second, the semantic clustering aims at grouping basic clusters whose relation instances share the same semantic meaning, dealing with more particularly phenomena such as synonymy or more complex paraphrase. Different similarities measures, either based on resources such as WordNet or distributional thesaurus, at the level of words, relation instances and basic clusters are analyzed. Moreover, a topic-based relation clustering is proposed to consider thematic information in relation clustering so that more precise semantic clusters can be formed. Finally, the thesis also tackles the problem of clustering evaluation in the context of unsupervised information extraction, using both internal and external measures. For the evaluations with external measures, an interactive and efficient way of building reference of relation clusters proposed. The application of this method on a newspaper corpus results in a large reference, based on which different clustering methods are evaluated.
|
3 |
Unsupervised Information Extraction From Text – Extraction and Clustering of Relations between Entities / Extraction d'Information Non Supervisée à Partir de Textes – Extraction et Regroupement de Relations entre EntitésWang, Wei 16 May 2013 (has links)
L'extraction d'information non supervisée en domaine ouvert est une évolution récente de l'extraction d'information adaptée à des contextes dans lesquels le besoin informationnel est faiblement spécifié. Dans ce cadre, la thèse se concentre plus particulièrement sur l'extraction et le regroupement de relations entre entités en se donnant la possibilité de traiter des volumes importants de données.L'extraction de relations se fixe plus précisément pour objectif de faire émerger des relations de type non prédéfini à partir de textes. Ces relations sont de nature semi-structurée : elles associent des éléments faisant référence à des structures de connaissance définies a priori, dans le cas présent les entités qu’elles relient, et des éléments donnés uniquement sous la forme d’une caractérisation linguistique, en l’occurrence leur type. Leur extraction est réalisée en deux temps : des relations candidates sont d'abord extraites sur la base de critères simples mais efficaces pour être ensuite filtrées selon des critères plus avancés. Ce filtrage associe lui-même deux étapes : une première étape utilise des heuristiques pour éliminer rapidement les fausses relations en conservant un bon rappel tandis qu'une seconde étape se fonde sur des modèles statistiques pour raffiner la sélection des relations candidates.Le regroupement de relations a quant à lui un double objectif : d’une part, organiser les relations extraites pour en caractériser le type au travers du regroupement des relations sémantiquement équivalentes et d’autre part, en offrir une vue synthétique. Il est réalisé dans le cas présent selon une stratégie multiniveau permettant de prendre en compte à la fois un volume important de relations et des critères de regroupement élaborés. Un premier niveau de regroupement, dit de base, réunit des relations proches par leur expression linguistique grâce à une mesure de similarité vectorielle appliquée à une représentation de type « sac-de-mots » pour former des clusters fortement homogènes. Un second niveau de regroupement est ensuite appliqué pour traiter des phénomènes plus sémantiques tels que la synonymie et la paraphrase et fusionner des clusters de base recouvrant des relations équivalentes sur le plan sémantique. Ce second niveau s'appuie sur la définition de mesures de similarité au niveau des mots, des relations et des clusters de relations en exploitant soit des ressources de type WordNet, soit des thésaurus distributionnels. Enfin, le travail illustre l’intérêt de la mise en œuvre d’un clustering des relations opéré selon une dimension thématique, en complément de la dimension sémantique des regroupements évoqués précédemment. Ce clustering est réalisé de façon indirecte au travers du regroupement des contextes thématiques textuels des relations. Il offre à la fois un axe supplémentaire de structuration des relations facilitant leur appréhension globale mais également le moyen d’invalider certains regroupements sémantiques fondés sur des termes polysémiques utilisés avec des sens différents. La thèse aborde également le problème de l'évaluation de l'extraction d'information non supervisée par l'entremise de mesures internes et externes. Pour les mesures externes, une méthode interactive est proposée pour construire manuellement un large ensemble de clusters de référence. Son application sur un corpus journalistique de grande taille a donné lieu à la construction d'une référence vis-à-vis de laquelle les différentes méthodes de regroupement proposées dans la thèse ont été évaluées. / Unsupervised information extraction in open domain gains more and more importance recently by loosening the constraints on the strict definition of the extracted information and allowing to design more open information extraction systems. In this new domain of unsupervised information extraction, this thesis focuses on the tasks of extraction and clustering of relations between entities at a large scale. The objective of relation extraction is to discover unknown relations from texts. A relation prototype is first defined, with which candidates of relation instances are initially extracted with a minimal criterion. To guarantee the validity of the extracted relation instances, a two-step filtering procedures is applied: the first step with filtering heuristics to remove efficiently large amount of false relations and the second step with statistical models to refine the relation candidate selection. The objective of relation clustering is to organize extracted relation instances into clusters so that their relation types can be characterized by the formed clusters and a synthetic view can be offered to end-users. A multi-level clustering procedure is design, which allows to take into account the massive data and diverse linguistic phenomena at the same time. First, the basic clustering groups similar relation instances by their linguistic expressions using only simple similarity measures on a bag-of-word representation for relation instances to form high-homogeneous basic clusters. Second, the semantic clustering aims at grouping basic clusters whose relation instances share the same semantic meaning, dealing with more particularly phenomena such as synonymy or more complex paraphrase. Different similarities measures, either based on resources such as WordNet or distributional thesaurus, at the level of words, relation instances and basic clusters are analyzed. Moreover, a topic-based relation clustering is proposed to consider thematic information in relation clustering so that more precise semantic clusters can be formed. Finally, the thesis also tackles the problem of clustering evaluation in the context of unsupervised information extraction, using both internal and external measures. For the evaluations with external measures, an interactive and efficient way of building reference of relation clusters proposed. The application of this method on a newspaper corpus results in a large reference, based on which different clustering methods are evaluated.
|
4 |
Towards deep content extraction from specialized discourse : the case of verbal relations in patent claimsFerraro, Gabriela 20 July 2012 (has links)
This thesis addresses the problem of the development of Natural Language
Processing techniques for the extraction and generalization of compositional
and functional relations from specialized written texts and, in particular, from
patent claims. One of the most demanding tasks tackled in the thesis is,
according to the state of the art, the semantic generalization of linguistic
denominations of relations between object components and processes
described in the texts. These denominations are usually verbal expressions or
nominalizations that are too concrete to be used as standard labels in
knowledge representation forms -as, for example, “A leads to B”, and “C
provokes D”, where “leads to” and “provokes” both express, in abstract
terms, a cause, such that in both cases “A CAUSE B” and “C CAUSE D”
would be more appropriate. A semantic generalization of the relations allows
us to achieve a higher degree of abstraction of the relationships between
objects and processes described in the claims and reduces their number to a
limited set that is oriented towards relations as commonly used in the generic
field of knowledge representation. / Esta tesis se centra en el del desarrollo de tecnologías del Procesamiento del
Lenguage Natural para la extracción y generalización de relaciones
encontradas en textos especializados; concretamente en las reivindicaciones
de patentes. Una de las tareas más demandadas de nuestro trabajo, desde el
punto vista del estado de la cuestión, es la generalización de las
denominaciones lingüísticas de las relaciones. Estas denominaciones,
usualmente verbos, son demasiado concretas para ser usadas como etiquetas
de relaciones en el contexto de la representación del conocimiento; por
ejemplo, “A lleva a B”, “B es el resultado de A” están mejor representadas
por “A causa B”. La generalización de relaciones permite reducir el n\'umero
de relaciones a un conjunto limitado, orientado al tipo de relaciones utilizadas
en el campo de la representación del conocimiento.
|
Page generated in 0.1381 seconds