Global ETD Search

21	Extracting Causal Relations between News Topics from Distributed Sources Miranda Ackerman, Eduardo Jacobo 08 November 2013 (has links) The overwhelming amount of online news presents a challenge called news information overload. To mitigate this challenge we propose a system to generate a causal network of news topics. To extract this information from distributed news sources, a system called Forest was developed. Forest retrieves documents that potentially contain causal information regarding a news topic. The documents are processed at a sentence level to extract causal relations and news topic references, these are the phases used to refer to a news topic. Forest uses a machine learning approach to classify causal sentences, and then renders the potential cause and effect of the sentences. The potential cause and effect are then classified as news topic references, these are the phrases used to refer to a news topics, such as “The World Cup” or “The Financial Meltdown”. Both classifiers use an algorithm developed within our working group, the algorithm performs better than several well known classification algorithms for the aforementioned tasks. In our evaluations we found that participants consider causal information useful to understand the news, and that while we can not extract causal information for all news topics, it is highly likely that we can extract causal relation for the most popular news topics. To evaluate the accuracy of the extractions made by Forest, we completed a user survey. We found that by providing the top ranked results, we obtained a high accuracy in extracting causal relations between news topics. Read more info:eu-repo/classification/ddc/004 ddc:004
22	Mobility Knowledge Graph and its Application in Public Transport Zhang, Qi January 2023 (has links) Efficient public transport planning, operations, and control rely on a deep understanding of human mobility in urban areas. The availability of extensive and diverse mobility data sources, such as smart card data and GPS data, provides opportunities to quantitatively study individual behavior and collective mobility patterns. However, analyzing and organizing these vast amounts of data is a challenging task. The Knowledge Graph (KG) is a graph-based method for knowledge representation and organization that has been successfully applied in various applications, yet the applications of KG in urban mobility are still limited. To further utilize the mobility data and explore human mobility patterns, the included papers constructed the Mobility Knowledge Graph (MKG), a general learning framework, and demonstrated its potential applications in public transport. Paper I introduces the concept of MKG and proposes a learning framework to construct MKG from smart card data in public transport networks. The framework captures the spatiotemporal travel pattern correlations between stations using both rule-based linear decomposition and neural network-based nonlinear decomposition methods. The paper validates the MKG construction framework and explores the value of MKG in predicting individual trip destinations using only tap-in records. Paper II proposes an application of user-station attention estimation to understand human mobility in urban areas, which facilitates downstream applications such as individual mobility prediction and location recommendation. To estimate the 'real' user-station attention from station visit counts data, the paper proposes a matrix decomposition method that captures both user similarity and station-station relations using the mobility knowledge graph (MKG). A neural network-based nonlinear decomposition approach was used to extract MKG relations capturing the latent spatiotemporal travel dependencies. The proposed framework is validated using synthetic and real-world data, demonstrating its significant value in contributing to user-station attention inference. / Effektiv planering, drift och kontroll av kollektivtrafik är beroende av end jup förståelse för mänsklig rörlighet i stadsområden. Tillgången till omfattande och varierande källor av rörlighetsdata, såsom data från smarta kort och GPS-data, ger möjligheter att kvantitativt studera individuellt beteende och kollektiva rörlighetsmönster. Att analysera och organisera dessa stora mängder data är dock en utmanande uppgift. Kunskapsgrafen (KG) är en grafba serad metod för kunskapsrepresentation och organisering som har tillämpats framgångsrikt inom olika områden, men användningen av KG inom urbana rörlighetsområden är fortfarande begränsad. För att ytterligare utnyttja rörlighetsdata och utforska mänskliga rörlighetsmönster har de inkluderade artiklarna konstruerat Mobility Knowledge Graph (MKG), en allmän inlärningsram, och visat dess potentiella tillämpningar inom kollektivtrafiken. Artikel I introducerar begreppet MKG och föreslår en inlärningsram för att konstruera MKG från data från smarta kort i kollektivtrafiknätverk. Ramverket fångar de rumsligt-temporala resmönstersambanden mellan stationer genom att använda både regelbaserade linjära dekomponeringsmetoder och neurala nätverksbaserade icke-linjära dekomponeringsmetoder. Artikeln validerar MKG-konstruktionsramverket och utforskar värdet av MKG för att förutsäga enskilda resmål med endast tap-in-register. Artikel II föreslår en tillämpning av uppskattning av användar-stations uppmärksamhet för att förstå mänsklig rörlighet i stadsområden, vilket underlättar efterföljande tillämpningar såsom individuell rörlighetsförutsägelse och platsrekommendationer. För att uppskatta den ’verkliga’ användar-stations uppmärksamheten från data om besöksantal på stationer föreslår artikeln en matrisdekomponeringsmetod som fångar både användarlikhet och station-stationsrelationer med hjälp av Mobility Knowledge Graph (MKG). En neural nätverksbaserad icke-linjär dekomponeringsmetod användes för att extrahera MKG-relationer som fångar de latenta rumsligt-temporala resberoendena. Det föreslagna ramverket valideras med hjälp av syntetiska och verkliga data och visar på dess betydande värde för att bidra till inferens av användar-stationsuppmärksamhet. / <p>QC231116</p> Read more Knowledge graph Smart card data Public transport User-station attention Relation extraction Transport Systems and Logistics Transportteknik och logistik
23	Mining relations from the biomedical literature Hakenberg, Jörg 05 February 2010 (has links) Textmining beschäftigt sich mit der automatisierten Annotierung von Texten und der Extraktion einzelner Informationen aus Texten, die dann für die Weiterverarbeitung zur Verfügung stehen. Texte können dabei kurze Zusammenfassungen oder komplette Artikel sein, zum Beispiel Webseiten und wissenschaftliche Artikel, umfassen aber auch textuelle Einträge in sonst strukturierten Datenbanken. Diese Dissertationsschrift bespricht zwei wesentliche Themen des biomedizinischen Textmining: die Extraktion von Zusammenhängen zwischen biologischen Entitäten ---das Hauptaugenmerk liegt dabei auf der Erkennung von Protein-Protein-Interaktionen---, und einen notwendigen Vorverarbeitungsschritt, die Erkennung von Proteinnamen. Diese Schrift beschreibt Ziele, Herausforderungen, sowie typische Herangehensweisen für alle wesentlichen Komponenten des biomedizinischen Textmining. Wir stellen eigene Methoden zur Erkennung von Proteinnamen sowie der Extraktion von Protein-Protein-Interaktionen vor. Zwei eigene Verfahren zur Erkennung von Proteinnamen werden besprochen, eines basierend auf einem Klassifikationsproblem, das andere basierend auf Suche in Wörterbüchern. Für die Extraktion von Interaktionen entwickeln wir eine Methode zur automatischen Annotierung großer Mengen von Text im Bezug auf Relationen; diese Annotationen werden dann zur Mustererkennung verwendet, um anschließend die gefundenen Muster auf neuen Text anwenden zu können. Um Muster zu erkennen, berechnen wir Ähnlichkeiten zwischen zuvor gefundenen Sätzen, die denselben Typ von Relation/Interaktion beschreiben. Diese Ähnlichkeiten speichern wir als sogenannte `consensus patterns''. Wir entwickeln eine Alignmentstrategie, die mehrschichtige Annotationen pro Position im Muster erlaubt. In Versuchen auf bekannten Benchmarks zeigen wir empirisch, dass unser vollautomatisches Verfahren Resultate erzielt, die vergleichbar sind mit existierenden Methoden, welche umfangreiche Eingriffe von Experten voraussetzen. / Text mining deals with the automated annotation of texts and the extraction of facts from textual data for subsequent analysis. Such texts range from short articles and abstracts to large documents, for instance web pages and scientific articles, but also include textual descriptions in otherwise structured databases. This thesis focuses on two key problems in biomedical text mining: relationship extraction from biomedical abstracts ---in particular, protein--protein interactions---, and a pre-requisite step, named entity recognition ---again focusing on proteins. This thesis presents goals, challenges, and typical approaches for each of the main building blocks in biomedical text mining. We present out own approaches for named entity recognition of proteins and relationship extraction of protein-protein interactions. For the first, we describe two methods, one set up as a classification task, the other based on dictionary-matching. For relationship extraction, we develop a methodology to automatically annotate large amounts of unlabeled data for relations, and make use of such annotations in a pattern matching strategy. This strategy first extracts similarities between sentences that describe relations, storing them as consensus patterns. We develop a sentence alignment approach that introduces multi-layer alignment, making use of multiple annotations per word. For the task of extracting protein-protein interactions, empirical results show that our methodology performs comparable to existing approaches that require a large amount of human intervention, either for annotation of data or creation of models. Read more Mustererkennung Textmining Bioinformatik Relationsextraktion pattern recognition Text mining bioinformatics relation extraction 004 Informatik 28 Informatik, Datenverarbeitung ddc:004
24	Formalizing biomedical concepts from textual definitions Petrova, Alina, Ma, Yue, Tsatsaronis, George, Kissa, Maria, Distel, Felix, Baader, Franz, Schroeder, Michael 07 January 2016 (has links) (PDF) BACKGROUND: Ontologies play a major role in life sciences, enabling a number of applications, from new data integration to knowledge verification. SNOMED CT is a large medical ontology that is formally defined so that it ensures global consistency and support of complex reasoning tasks. Most biomedical ontologies and taxonomies on the other hand define concepts only textually, without the use of logic. Here, we investigate how to automatically generate formal concept definitions from textual ones. We develop a method that uses machine learning in combination with several types of lexical and semantic features and outputs formal definitions that follow the structure of SNOMED CT concept definitions. RESULTS: We evaluate our method on three benchmarks and test both the underlying relation extraction component as well as the overall quality of output concept definitions. In addition, we provide an analysis on the following aspects: (1) How do definitions mined from the Web and literature differ from the ones mined from manually created definitions, e.g., MeSH? (2) How do different feature representations, e.g., the restrictions of relations' domain and range, impact on the generated definition quality?, (3) How do different machine learning algorithms compare to each other for the task of formal definition generation?, and, (4) What is the influence of the learning data size to the task? We discuss all of these settings in detail and show that the suggested approach can achieve success rates of over 90%. In addition, the results show that the choice of corpora, lexical features, learning algorithm and data size do not impact the performance as strongly as semantic types do. Semantic types limit the domain and range of a predicted relation, and as long as relations' domain and range pairs do not overlap, this information is most valuable in formalizing textual definitions. CONCLUSIONS: The analysis presented in this manuscript implies that automated methods can provide a valuable contribution to the formalization of biomedical knowledge, thus paving the way for future applications that go beyond retrieval and into complex reasoning. The method is implemented and accessible to the public from: https://github.com/alifahsyamsiyah/learningDL. Read more Biomedizinische Ontologien formale Definitionen TU Dresden Publikationsfonds Biomedical ontologies Formal definitions MeSH Relation extraction SNOMED CT Technical University Dresden Publication funds ddc:610 rvk:XA 10000
25	[en] DISTANT SUPERVISION FOR RELATION EXTRACTION USING ONTOLOGY CLASS HIERARCHY-BASED FEATURES / [pt] SUPERVISÃO À DISTÂNCIA EM EXTRAÇÃO DE RELACIONAMENTOS USANDO CARACTERÍSTICAS BASEADAS EM HIERARQUIA DE CLASSES EM ONTOLOGIAS PEDRO HENRIQUE RIBEIRO DE ASSIS 18 March 2015 (has links) [pt] Extração de relacionamentos é uma etapa chave para o problema de identificação de uma estrutura em um texto em formato de linguagem natural. Em geral, estruturas são compostas por entidades e relacionamentos entre elas. As propostas de solução com maior sucesso aplicam aprendizado de máquina supervisionado a corpus anotados à mão para a criação de classificadores de alta precisão. Embora alcancem boa robustez, corpus criados à mão não são escaláveis por serem uma alternativa de grande custo. Neste trabalho, nós aplicamos um paradigma alternativo para a criação de um número considerável de exemplos de instâncias para classificação. Tal método é chamado de supervisão à distância. Em conjunto com essa alternativa, usamos ontologias da Web semântica para propor e usar novas características para treinar classificadores. Elas são baseadas na estrutura e semântica descrita por ontologias onde recursos da Web semântica são definidos. O uso de tais características tiveram grande impacto na precisão e recall dos nossos classificadores finais. Neste trabalho, aplicamos nossa teoria em um corpus extraído da Wikipedia. Alcançamos uma alta precisão e recall para um número considerável de relacionamentos. / [en] Relation extraction is a key step for the problem of rendering a structure from natural language text format. In general, structures are composed by entities and relationships among them. The most successful approaches on relation extraction apply supervised machine learning on hand-labeled corpus for creating highly accurate classifiers. Although good robustness is achieved, hand-labeled corpus are not scalable due to the expensive cost of its creation. In this work we apply an alternative paradigm for creating a considerable number of examples of instances for classification. Such method is called distant supervision. Along with this alternative approach we adopt Semantic Web ontologies to propose and use new features for training classifiers. Those features are based on the structure and semantics described by ontologies where Semantic Web resources are defined. The use of such features has a great impact on the precision and recall of our final classifiers. In this work, we apply our theory on corpus extracted from Wikipedia. We achieve a high precision and recall for a considerable number of relations. Read more [pt] APRENDIZADO DE MAQUINA [en] MACHINE LEARNING [pt] WEB SEMANTICA [en] SEMANTIC WEB [pt] EXTRACAO DE RELACIONAMENTOS [en] RELATION EXTRACTION [pt] SUPERVISAO A DISTANCIA [en] DISTANT SUPERVISION [pt] PROCESSAMENTO NATURAL DE LINGUAGENS [en] NATURAL LANGUAGE PROCESSING
26	Extraction de relations en domaine de spécialité / Relation extraction in specialized domains Minard, Anne-Lyse 07 December 2012 (has links) La quantité d'information disponible dans le domaine biomédical ne cesse d'augmenter. Pour que cette information soit facilement utilisable par les experts d'un domaine, il est nécessaire de l'extraire et de la structurer. Pour avoir des données structurées, il convient de détecter les relations existantes entre les entités dans les textes. Nos recherches se sont focalisées sur la question de l'extraction de relations complexes représentant des résultats expérimentaux, et sur la détection et la catégorisation de relations binaires entre des entités biomédicales. Nous nous sommes intéressée aux résultats expérimentaux présentés dans les articles scientifiques. Nous appelons résultat expérimental, un résultat quantitatif obtenu suite à une expérience et mis en relation avec les informations permettant de décrire cette expérience. Ces résultats sont importants pour les experts en biologie, par exemple pour faire de la modélisation. Dans le domaine de la physiologie rénale, une base de données a été créée pour centraliser ces résultats d'expérimentation, mais l'alimentation de la base est manuelle et de ce fait longue. Nous proposons une solution pour extraire automatiquement des articles scientifiques les connaissances pertinentes pour la base de données, c'est-à-dire des résultats expérimentaux que nous représentons par une relation n-aire. La méthode procède en deux étapes : extraction automatique des documents et proposition de celles-ci pour validation ou modification par l'expert via une interface. Nous avons également proposé une méthode à base d'apprentissage automatique pour l'extraction et la classification de relations binaires en domaine de spécialité. Nous nous sommes intéressée aux caractéristiques et variétés d'expressions des relations, et à la prise en compte de ces caractéristiques dans un système à base d'apprentissage. Nous avons étudié la prise en compte de la structure syntaxique de la phrase et la simplification de phrases dirigée pour la tâche d'extraction de relations. Nous avons en particulier développé une méthode de simplification à base d'apprentissage automatique, qui utilise en cascade plusieurs classifieurs. / The amount of available scientific literature is constantly growing. If the experts of a domain want to easily access this information, it must be extracted and structured. To obtain structured data, both entities and relations of the texts must be detected. Our research is about the problem of complex relation extraction which represent experimental results, and detection and classification of binary relations between biomedical entities. We are interested in experimental results presented in scientific papers. An experimental result is a quantitative result obtained by an experimentation and linked with information that describes this experimentation. These results are important for biology experts, for example for doing modelization. In the domain of renal physiology, a database was created to centralize these experimental results, but the base is manually populated, therefore the population takes a long time. We propose a solution to automatically extract relevant knowledge for the database from the scientific papers, that is experimental results which are represented by a n-ary relation. The method proceeds in two steps: automatic extraction from documents and proposal of information extracted for approval or modification by the experts via an interface. We also proposed a method based on machine learning for extraction and classification of binary relations in specialized domains. We focused on the variations of the expression of relations, and how to represent them in a machine learning system. We studied the way to take into account syntactic structure of the sentence and the sentence simplification guided by the task of relation extraction. In particular, we developed a simplification method based on machine learning, which uses a series of classifiers. Read more Extraction de relations Relation binaire Relation n-aire Domaine biomédical SVM Information syntaxique Simplification de phrases Relation extraction Binary relation N-ary relation Biomedical domain SVM Syntactic information Sentence simplification
27	Construção automática de redes bayesianas para extração de interações proteína-proteína a partir de textos biomédicos / Learning Bayesian networks for extraction of protein-protein interaction from biomedical articles Juárez, Pedro Nelson Shiguihara 20 June 2013 (has links) A extração de Interações Proteína-Proteína (IPPs) a partir de texto é um problema relevante na área biomédica e um desafio na área de aprendizado de máquina. Na área biomédica, as IPPs são fundamentais para compreender o funcionamento dos seres vivos. No entanto, o número de artigos relacionados com IPPs está aumentando rapidamente, sendo impraticável identicá-las e catalogá-las manualmente. Por exemplo, no caso das IPPs humanas apenas 10% foram catalogadas. Por outro lado, em aprendizado de máquina, métodos baseados em kernels são frequentemente empregados para extrair automaticamente IPPs, atingindo resultados considerados estado da arte. Esses métodos usam informações léxicas, sintáticas ou semânticas como características. Entretanto, os resultados ainda são insuficientes, atingindo uma taxa relativamente baixa, em termos da medida F, devido à complexidade do problema. Apesar dos esforços em produzir kernels, cada vez mais sofisticados, usando árvores sintáticas como árvores constituintes ou de dependência, pouco é conhecido sobre o desempenho de outras abordagens de aprendizado de máquina como, por exemplo, as redes bayesianas. As àrvores constituintes são estruturas de grafos que contêm informação importante da gramática subjacente as sentenças de textos contendo IPPs. Por outro lado, a rede bayesiana permite modelar algumas regras da gramática e atribuir para elas uma distribuição de probabilidade de acordo com as sentenças de treinamento. Neste trabalho de mestrado propõe-se um método para construção automática de redes bayesianas a partir de árvores contituintes para extração de IPPs. O método foi testado em cinco corpora padrões da extração de IPPs, atingindo resultados competitivos, em alguns casos melhores, em comparação a métodos do estado da arte / Extracting Protein-Protein Interactions (PPIs) from text is a relevant problem in the biomedical field and a challenge in the area of machine learning. In the biomedical field, the PPIs are fundamental to understand the functioning of living organisms. However, the number of articles related to PPIs is increasing rapidly, hence it is impractical to identify and catalog them manually. For example, in the case of human PPIs only 10 % have been cataloged. On the other hand, machine learning methods based on kernels are often employed to automatically extract PPIs, achieving state of the art results. These methods use lexical, syntactic and semantic information as features. However, the results are still poor, reaching a relatively low rate of F-measure due to the complexity of the problem. Despite efforts to produce sophisticate kernels, using syntactic trees as constituent or dependency trees, little is known about the performance of other Machine Learning approaches, eg, Bayesian networks. Constituent tree structures are graphs which contain important information of the underlying grammar in sentences containing PPIs. On the other hand, the Bayesian network allows modeling some rules of grammar and assign to them a probability distribution according to the training sentences. In this master thesis we propose a method for automatic construction of Bayesian networks from constituent trees for extracting PPIs. The method was tested in five corpora, considered benchmark of extraction of PPI, achieving competitive results, and in some cases better results when compared to state of the art methods Read more Aprendizado de máquina Bayesian networks Extração de informação Extração de relação Information extraction Machine learning Protei-protein interaction extraction Redes bayesianas Relation extraction
28	Desenvolvimento de m?todo para consulta em linguagem natural de componentes de software / Development of method for natural language research of software components Domingues, Paulo Eduardo 28 June 2007 (has links) Made available in DSpace on 2016-04-04T18:31:20Z (GMT). No. of bitstreams: 1 Paulo Eduardo Domingues.pdf: 2694773 bytes, checksum: 8954b221ccf920e889584da2390badf6 (MD5) Previous issue date: 2007-06-28 / The development based on components allows to create inter-operable components, with well defined interfaces, reducing the complexity in the software development. In this scene, the library of software components plays an important role in corporate level, supporting documentation, specification, storage and recovery of components. Inside organizations, a components library supplies infrastructure for components lifecycle management. This work considers the storage and the recovery of components of software with the use of an interface in natural language. A method to generate a representation form is described, to be stored in the library, for the texts that describe the characteristics of the components that live in the library. The text of the research generated for the user also is represented of similar form to allow the comparison between the descriptions of the components of the library and the question of the user. Additionally the method is presented to determine the similarity between parts of the representations of the text of the characteristics with the text of the research, of form to return as resulted in sequence decreasing indication from priority the components that better take care of the research of the user. / O desenvolvimento baseado em componentes permite criar componentes inter-oper?veis, com interfaces bem definidas, reduzindo a complexidade no desenvolvimento de software. Neste cen?rio, a biblioteca de componentes de software exerce um papel importante em um ambiente corporativo, suportando a documenta??o, especifica??o, armazenamento e recupera??o de componentes. Dentro das organiza??es, uma biblioteca de componentes fornece uma infra-estrutura para o gerenciamento do ciclo de vida dos componentes. Este trabalho prop?e o armazenamento e a recupera??o de componentes de software com a utiliza??o de uma interface em linguagem natural. ? descrito um m?todo para gerar uma forma de representa??o, a ser armazenada na biblioteca, para os textos que descrevem as caracter?sticas dos componentes que integram a biblioteca. O texto da consulta gerada pelo usu?rio tamb?m ? representado de forma semelhante para permitir a compara??o entre as descri??es dos componentes da biblioteca e a quest?o do usu?rio. Adicionalmente, ? apresentado o m?todo para determinar a semelhan?a entre partes das representa??es do texto das caracter?sticas com o texto das consultas, de forma a retornar como resultado a indica??o em ordem decrescente de prioridade os componentes que melhor atendem a consulta do usu?rio. Read more linguagem natural extra??o de rela??es biblioteca de componentes reuso de software natural language relation extraction components library software reuse
29	Formalizing biomedical concepts from textual definitions Tsatsaronis, George, Ma, Yue, Petrova, Alina, Kissa, Maria, Distel, Felix, Baader , Franz, Schroeder, Michael 04 January 2016 (has links) (PDF) Background Ontologies play a major role in life sciences, enabling a number of applications, from new data integration to knowledge verification. SNOMED CT is a large medical ontology that is formally defined so that it ensures global consistency and support of complex reasoning tasks. Most biomedical ontologies and taxonomies on the other hand define concepts only textually, without the use of logic. Here, we investigate how to automatically generate formal concept definitions from textual ones. We develop a method that uses machine learning in combination with several types of lexical and semantic features and outputs formal definitions that follow the structure of SNOMED CT concept definitions. Results We evaluate our method on three benchmarks and test both the underlying relation extraction component as well as the overall quality of output concept definitions. In addition, we provide an analysis on the following aspects: (1) How do definitions mined from the Web and literature differ from the ones mined from manually created definitions, e.g., MeSH? (2) How do different feature representations, e.g., the restrictions of relations’ domain and range, impact on the generated definition quality?, (3) How do different machine learning algorithms compare to each other for the task of formal definition generation?, and, (4) What is the influence of the learning data size to the task? We discuss all of these settings in detail and show that the suggested approach can achieve success rates of over 90%. In addition, the results show that the choice of corpora, lexical features, learning algorithm and data size do not impact the performance as strongly as semantic types do. Semantic types limit the domain and range of a predicted relation, and as long as relations’ domain and range pairs do not overlap, this information is most valuable in formalizing textual definitions. Conclusions The analysis presented in this manuscript implies that automated methods can provide a valuable contribution to the formalization of biomedical knowledge, thus paving the way for future applications that go beyond retrieval and into complex reasoning. The method is implemented and accessible to the public from: https://github.com/alifahsyamsiyah/learningDL. Read more formale Definition biomedizinische Ontologien TU Dresden Publikationsfonds Formal definitions Biomedical ontologies Relation extraction SNOMED CT MeSH Technical University Dresden Publication funds ddc:570 rvk:WH 3100
30	Apprentissage non supervisé de dépendances à partir de textes / Unsupervised dependency parsing from texts Arcadias, Marie 02 October 2015 (has links) Les grammaires de dépendance permettent de construire une organisation hiérarchique syntaxique des mots d’une phrase. La construction manuelle des arbres de dépendances étant une tâche exigeant temps et expertise, de nombreux travaux cherchent à l’automatiser. Visant à établir un processus léger et facilement adaptable nous nous sommes intéressés à l’apprentissage non supervisé de dépendances, évitant ainsi d’avoir recours à une expertise coûteuse. L’état de l’art en apprentissage non supervisé de dépendances (DMV) se compose de méthodes très complexes et extrêmement sensibles au paramétrage initial. Nous présentons dans cette thèse un nouveau modèle pour résoudre ce problème d’analyse de dépendances, mais de façon plus simple, plus rapide et plus adaptable. Nous apprenons une famille de grammaires (PCFG) réduites à moins de 6 non terminaux et de 15 règles de combinaisons des non terminaux à partir des étiquettes grammaticales. Les PCFG de cette famille que nous nommons DGdg (pour DROITE GAUCHE droite gauche) se paramètrent très légèrement, ainsi elles s’adaptent sans effort aux 12 langues testées. L’apprentissage et l’analyse sont effectués au moins deux fois plus rapidement que DMV sur les mêmes données. Et la qualité des analyses DGdg est pour certaines langues proches des analyses par DMV. Nous proposons une première application de notre méthode d’analyse de dépendances à l’extraction d’informations. Nous apprenons par des CRF un étiquetage en fonctions « sujet », « objet » et « prédicat », en nous fondant sur des caractéristiques extraites des arbres construits. / Dependency grammars allow the construction of a hierarchical organization of the words of sentences. The one-by-one building of dependency trees can be very long and it requries expert knowledge. In this regard, we are interested in unsupervised dependency learning. Currently, DMV give the state-of-art results in unsupervised dependency parsing. However, DMV has been known to be highly sensitive to initial parameters. The training of DMV model is also heavy and long. We present in this thesis a new model to solve this problem in a simpler, faster and more adaptable way. We learn a family of PCFG using less than 6 nonterminal symbols and less than 15 combination rules from the part-of-speech tags. The tuning of these PCFG is ligth, and so easily adaptable to the 12 languages we tested. Our proposed method for unsupervised dependency parsing can show the near state-of-the-art results, being twice faster. Moreover, we describe our interests in dependency trees to other applications such as relation extraction. Therefore, we show how such information from dependency structures can be integrated into condition random fields and how to improve a relation extraction task. Read more Apprentissage non supervisé Grammaire de dépendances Grammaire hors contexte CYK Inside-Outside CRF Extraction de relations Unsupervised machine learning Dependency grammar Context-free grammar CKY Inside- Outside CRF Relation extraction 006.35

Search results