1 |
Deriving Semantic Objects from the Structured Web (Inférer des Objects Sémantiques du Web Structuré)Oita, Marilena 29 October 2012 (has links) (PDF)
This thesis focuses on the extraction and analysis of Web data objects, investigated from different points of view: temporal, structural, semantic. We first survey different strategies and best practices for deriving temporal aspects of Web pages, together with a more in-depth study on Web feeds for this particular purpose. Next, in the context of dynamically-generated Web pages by content management systems, we present two keyword-based techniques that perform article extraction from such pages. Keywords, either automatically acquired through a Tf−Idf analysis, or extracted from Web feeds, guide the process of object identification, either at the level of a single Web page (SIGFEED algorithm), or across different pages sharing the same template (FOREST algorithm). We finally present, in the context of the deep Web, a generic framework which aims at discovering the semantic model of a Web object (here, data record) by, first, using FOREST for the extraction of objects, and second, by representing the implicit rdf:type similarities between the object attributes and the entity of the Web interface as relationships that, together with the instances extracted from the objects, form a labeled graph. This graph is further aligned to a generic ontology like YAGO for the discovery of the graph's unknown types and relations.
|
2 |
Unsupervised discovery of relations for analysis of textual data in digital forensicsLouis, Anita Lily 23 August 2010 (has links)
This dissertation addresses the problem of analysing digital data in digital forensics. It will be shown that text mining methods can be adapted and applied to digital forensics to aid analysts to more quickly, efficiently and accurately analyse data to reveal truly useful information. Investigators who wish to utilise digital evidence must examine and organise the data to piece together events and facts of a crime. The difficulty with finding relevant information quickly using the current tools and methods is that these tools rely very heavily on background knowledge for query terms and do not fully utilise the content of the data. A novel framework in which to perform evidence discovery is proposed in order to reduce the quantity of data to be analysed, aid the analysts' exploration of the data and enhance the intelligibility of the presentation of the data. The framework combines information extraction techniques with visual exploration techniques to provide a novel approach to performing evidence discovery, in the form of an evidence discovery system. By utilising unrestricted, unsupervised information extraction techniques, the investigator does not require input queries or keywords for searching, thus enabling the investigator to analyse portions of the data that may not have been identified by keyword searches. The evidence discovery system produces text graphs of the most important concepts and associations extracted from the full text to establish ties between the concepts and provide an overview and general representation of the text. Through an interactive visual interface the investigator can explore the data to identify suspects, events and the relations between suspects. Two models are proposed for performing the relation extraction process of the evidence discovery framework. The first model takes a statistical approach to discovering relations based on co-occurrences of complex concepts. The second model utilises a linguistic approach using named entity extraction and information extraction patterns. A preliminary study was performed to assess the usefulness of a text mining approach to digital forensics as against the traditional information retrieval approach. It was concluded that the novel approach to text analysis for evidence discovery presented in this dissertation is a viable and promising approach. The preliminary experiment showed that the results obtained from the evidence discovery system, using either of the relation extraction models, are sensible and useful. The approach advocated in this dissertation can therefore be successfully applied to the analysis of textual data for digital forensics Copyright / Dissertation (MSc)--University of Pretoria, 2010. / Computer Science / unrestricted
|
3 |
Discovering relations between indirectly connected biomedical conceptsTsatsaronis, George, Weissenborn, Dirk, Schroeder, Michael 04 January 2016 (has links) (PDF)
BACKGROUND:
The complexity and scale of the knowledge in the biomedical domain has motivated research work towards mining heterogeneous data from both structured and unstructured knowledge bases. Towards this direction, it is necessary to combine facts in order to formulate hypotheses or draw conclusions about the domain concepts. This work addresses this problem by using indirect knowledge connecting two concepts in a knowledge graph to discover hidden relations between them. The graph represents concepts as vertices and relations as edges, stemming from structured (ontologies) and unstructured (textual) data. In this graph, path patterns, i.e. sequences of relations, are mined using distant supervision that potentially characterize a biomedical relation.
RESULTS:
It is possible to identify characteristic path patterns of biomedical relations from this representation using machine learning. For experimental evaluation two frequent biomedical relations, namely \"has target\", and \"may treat\", are chosen. Results suggest that relation discovery using indirect knowledge is possible, with an AUC that can reach up to 0.8, a result which is a great improvement compared to the random classification, and which shows that good predictions can be prioritized by following the suggested approach.
CONCLUSIONS:
Analysis of the results indicates that the models can successfully learn expressive path patterns for the examined relations. Furthermore, this work demonstrates that the constructed graph allows for the easy integration of heterogeneous information and discovery of indirect connections between biomedical concepts.
|
4 |
Discovering relations between indirectly connected biomedical concepts: Research ArticleTsatsaronis, George, Weissenborn, Dirk, Schroeder, Michael 04 January 2016 (has links)
BACKGROUND:
The complexity and scale of the knowledge in the biomedical domain has motivated research work towards mining heterogeneous data from both structured and unstructured knowledge bases. Towards this direction, it is necessary to combine facts in order to formulate hypotheses or draw conclusions about the domain concepts. This work addresses this problem by using indirect knowledge connecting two concepts in a knowledge graph to discover hidden relations between them. The graph represents concepts as vertices and relations as edges, stemming from structured (ontologies) and unstructured (textual) data. In this graph, path patterns, i.e. sequences of relations, are mined using distant supervision that potentially characterize a biomedical relation.
RESULTS:
It is possible to identify characteristic path patterns of biomedical relations from this representation using machine learning. For experimental evaluation two frequent biomedical relations, namely \'has target\', and \'may treat\', are chosen. Results suggest that relation discovery using indirect knowledge is possible, with an AUC that can reach up to 0.8, a result which is a great improvement compared to the random classification, and which shows that good predictions can be prioritized by following the suggested approach.
CONCLUSIONS:
Analysis of the results indicates that the models can successfully learn expressive path patterns for the examined relations. Furthermore, this work demonstrates that the constructed graph allows for the easy integration of heterogeneous information and discovery of indirect connections between biomedical concepts.
|
5 |
Uso de informação linguística e análise de conceitos formais no aprendizado de ontologias / Use of linguistic information and formal concept analysis for ontology learning.Torres, Carlos Eduardo Atencio 08 October 2012 (has links)
Na atualidade, o interesse pelo uso de ontologias tem sido incrementado. No entanto, o processo de construção pode ser custoso em termos de tempo. Para uma ontologia ser construída, precisa-se de um especialista com conhecimentos de um editor de ontologias. Com a finalidade de reduzir tal processo de construção pelo especialista, analisamos e propomos um método para realizar aprendizado de ontologias (AO) de forma supervisionada. O presente trabalho consiste em uma abordagem combinada de diferentes técnicas no AO. Primeiro, usamos uma técnica estatística chamada C/NC-values, acompanhada da ferramenta Cogroo, para extrair os termos mais representativos do texto. Esses termos são considerados por sua vez como conceitos. Projetamos também uma gramática de restrições (GR), com base na informação linguística do Português, com o objetivo de reconhecer e estabelecer relações entre conceitos. Para poder enriquecer a informação na ontologia, usamos a análise de conceitos formais (ACF) com o objetivo de identificar possíveis superconceitos entre dois conceitos. Finalmente, extraímos ontologias para os textos de três temas, submetendo-as à avaliação dos especialistas na área. Um web site foi feito para tornar o processo de avaliação mais amigável para os avaliadores e usamos o questionário de marcos de características proposto pelo método OntoMetrics. Os resultados mostram que nosso método provê um ponto de partida aceitável para a construção de ontologias. / Nowadays, the interest in the use of ontologies has increased, nevertheless, the process of ontology construction can be very time consuming. To build an ontology, we need a domain expert with knowledge in an ontology editor. In order to reduce the time needed by the expert, we propose and analyse a supervised ontology learning (OL) method. The present work consists of a combined approach of different techniques in OL. First, we use a statistic technique called C/NC-values, with the help of the Cogroo tool, to extract the most significant terms. These terms are considered as concepts consequently. We also design a constraint grammar (CG) based in linguistic information of Portuguese to recognize relations between concepts. To enrich the ontology information, we use the formal concept analysis (FCA) in order to discover a parent for a set of concepts. In order to evaluate the method, we have extracted ontologies from text on three different domains and tested them with corresponding experts. A web site was built to make the evaluation process friendlier for the experts and we used an evaluation framework proposed in the OntoMetrics method. The results show that our method provides an acceptable starting point for the construction of ontologies.
|
6 |
Toward Robust Information Extraction Models for Multimedia DocumentsEbadat, Ali-Reza 17 October 2012 (has links) (PDF)
Au cours de la dernière décennie, d'énormes quantités de documents multimédias ont été générées. Il est donc important de trouver un moyen de gérer ces données, notamment d'un point de vue sémantique, ce qui nécessite une connaissance fine de leur contenu. Il existe deux familles d'approches pour ce faire, soit par l'extraction d'informations à partir du document (par ex., audio, image), soit en utilisant des données textuelles extraites du document ou de sources externes (par ex., Web). Notre travail se place dans cette seconde famille d'approches ; les informations extraites des textes peuvent ensuite être utilisées pour annoter les documents multimédias et faciliter leur gestion. L'objectif de cette thèse est donc de développer de tels modèles d'extraction d'informations. Mais les textes extraits des documents multimédias étant en général petits et bruités, ce travail veille aussi à leur nécessaire robustesse. Nous avons donc privilégié des techniques simples nécessitant peu de connaissances externes comme garantie de robustesse, en nous inspirant des travaux en recherche d'information et en analyse statistique des textes. Nous nous sommes notamment concentré sur trois tâches : l'extraction supervisée de relations entre entités, la découverte de relations, et la découverte de classes d'entités. Pour l'extraction de relations, nous proposons une approche supervisée basée sur les modèles de langues et l'algorithme d'apprentissage des k-plus-proches voisins. Les résultats expérimentaux montrent l'efficacité et la robustesse de nos modèles, dépassant les systèmes état-de-l'art tout en utilisant des informations linguistiques plus simples à obtenir. Dans la seconde tâche, nous passons à un modèle non supervisé pour découvrir les relations au lieu d'en extraire des prédéfinies. Nous modélisons ce problème comme une tâche de clustering avec une fonction de similarité là encore basée sur les modèles de langues. Les performances, évaluées sur un corpus de vidéos de matchs de football, montrnt l'intérêt de notre approche par rapport aux modèles classiques. Enfin, dans la dernière tâche, nous nous intéressons non plus aux relations mais aux entités, source d'informations essentielles dans les documents. Nous proposons une technique de clustering d'entités afin de faire émerger, sans a priori, des classes sémantiques parmi celles-ci, en adoptant une représentation nouvelle des données permettant de mieux tenir compte des chaque occurrence des entités. En guise de conclusion, nous avons montré expérimentalement que des techniques simples, exigeant peu de connaissances a priori, et utilisant des informations linguistique facilement accessibles peuvent être suffisantes pour extraire efficacement des informations précises à partir du texte. Dans notre cas, ces bons résultats sont obtenus en choisissant une représentation adaptée pour les données, basée sur une analyse statistique ou des modèles de recherche d'information. Le chemin est encore long avant d'être en mesure de traiter directement des documents multimédia, mais nous espérons que nos propositions pourront servir de tremplin pour les recherches futures dans ce domaine.
|
7 |
Uso de informação linguística e análise de conceitos formais no aprendizado de ontologias / Use of linguistic information and formal concept analysis for ontology learning.Carlos Eduardo Atencio Torres 08 October 2012 (has links)
Na atualidade, o interesse pelo uso de ontologias tem sido incrementado. No entanto, o processo de construção pode ser custoso em termos de tempo. Para uma ontologia ser construída, precisa-se de um especialista com conhecimentos de um editor de ontologias. Com a finalidade de reduzir tal processo de construção pelo especialista, analisamos e propomos um método para realizar aprendizado de ontologias (AO) de forma supervisionada. O presente trabalho consiste em uma abordagem combinada de diferentes técnicas no AO. Primeiro, usamos uma técnica estatística chamada C/NC-values, acompanhada da ferramenta Cogroo, para extrair os termos mais representativos do texto. Esses termos são considerados por sua vez como conceitos. Projetamos também uma gramática de restrições (GR), com base na informação linguística do Português, com o objetivo de reconhecer e estabelecer relações entre conceitos. Para poder enriquecer a informação na ontologia, usamos a análise de conceitos formais (ACF) com o objetivo de identificar possíveis superconceitos entre dois conceitos. Finalmente, extraímos ontologias para os textos de três temas, submetendo-as à avaliação dos especialistas na área. Um web site foi feito para tornar o processo de avaliação mais amigável para os avaliadores e usamos o questionário de marcos de características proposto pelo método OntoMetrics. Os resultados mostram que nosso método provê um ponto de partida aceitável para a construção de ontologias. / Nowadays, the interest in the use of ontologies has increased, nevertheless, the process of ontology construction can be very time consuming. To build an ontology, we need a domain expert with knowledge in an ontology editor. In order to reduce the time needed by the expert, we propose and analyse a supervised ontology learning (OL) method. The present work consists of a combined approach of different techniques in OL. First, we use a statistic technique called C/NC-values, with the help of the Cogroo tool, to extract the most significant terms. These terms are considered as concepts consequently. We also design a constraint grammar (CG) based in linguistic information of Portuguese to recognize relations between concepts. To enrich the ontology information, we use the formal concept analysis (FCA) in order to discover a parent for a set of concepts. In order to evaluate the method, we have extracted ontologies from text on three different domains and tested them with corresponding experts. A web site was built to make the evaluation process friendlier for the experts and we used an evaluation framework proposed in the OntoMetrics method. The results show that our method provides an acceptable starting point for the construction of ontologies.
|
Page generated in 0.1201 seconds