• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 34
  • 6
  • 6
  • 4
  • 4
  • 4
  • 2
  • 2
  • 1
  • 1
  • 1
  • Tagged with
  • 70
  • 70
  • 22
  • 19
  • 17
  • 16
  • 14
  • 14
  • 12
  • 12
  • 11
  • 11
  • 11
  • 10
  • 10
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
41

Apport des ontologies de domaine pour l'extraction de connaissances à partir de données biomédicales / Contribution of domain ontologies for knowledge discovery in biomedical data

Personeni, Gabin 09 November 2018 (has links)
Le Web sémantique propose un ensemble de standards et d'outils pour la formalisation et l'interopérabilité de connaissances partagées sur le Web, sous la forme d'ontologies. Les ontologies biomédicales et les données associées constituent de nos jours un ensemble de connaissances complexes, hétérogènes et interconnectées, dont l'analyse est porteuse de grands enjeux en santé, par exemple dans le cadre de la pharmacovigilance. On proposera dans cette thèse des méthodes permettant d'utiliser ces ontologies biomédicales pour étendre les possibilités d'un processus de fouille de données, en particulier, permettant de faire cohabiter et d'exploiter les connaissances de plusieurs ontologies biomédicales. Les travaux de cette thèse concernent dans un premier temps une méthode fondée sur les structures de patrons, une extension de l'analyse formelle de concepts pour la découverte de co-occurences de événements indésirables médicamenteux dans des données patients. Cette méthode utilise une ontologie de phénotypes et une ontologie de médicaments pour permettre la comparaison de ces événements complexes, et la découverte d'associations à différents niveaux de généralisation, par exemple, au niveau de médicaments ou de classes de médicaments. Dans un second temps, on utilisera une méthode numérique fondée sur des mesures de similarité sémantique pour la classification de déficiences intellectuelles génétiques. On étudiera deux mesures de similarité utilisant des méthodes de calcul différentes, que l'on utilisera avec différentes combinaisons d'ontologies phénotypiques et géniques. En particulier, on quantifiera l'influence que les différentes connaissances de domaine ont sur la capacité de classification de ces mesures, et comment ces connaissances peuvent coopérer au sein de telles méthodes numériques. Une troisième étude utilise les données ouvertes liées ou LOD du Web sémantique et les ontologies associées dans le but de caractériser des gènes responsables de déficiences intellectuelles. On utilise ici la programmation logique inductive, qui s'avère adaptée pour fouiller des données relationnelles comme les LOD, en prenant en compte leurs relations avec les ontologies, et en extraire un modèle prédictif et descriptif des gènes responsables de déficiences intellectuelles. L'ensemble des contributions de cette thèse montre qu'il est possible de faire coopérer avantageusement une ou plusieurs ontologies dans divers processus de fouille de données / The semantic Web proposes standards and tools to formalize and share knowledge on the Web, in the form of ontologies. Biomedical ontologies and associated data represents a vast collection of complex, heterogeneous and linked knowledge. The analysis of such knowledge presents great opportunities in healthcare, for instance in pharmacovigilance. This thesis explores several ways to make use of this biomedical knowledge in the data mining step of a knowledge discovery process. In particular, we propose three methods in which several ontologies cooperate to improve data mining results. A first contribution of this thesis describes a method based on pattern structures, an extension of formal concept analysis, to extract associations between adverse drug events from patient data. In this context, a phenotype ontology and a drug ontology cooperate to allow a semantic comparison of these complex adverse events, and leading to the discovery of associations between such events at varying degrees of generalization, for instance, at the drug or drug class level. A second contribution uses a numeric method based on semantic similarity measures to classify different types of genetic intellectual disabilities, characterized by both their phenotypes and the functions of their linked genes. We study two different similarity measures, applied with different combinations of phenotypic and gene function ontologies. In particular, we investigate the influence of each domain of knowledge represented in each ontology on the classification process, and how they can cooperate to improve that process. Finally, a third contribution uses the data component of the semantic Web, the Linked Open Data (LOD), together with linked ontologies, to characterize genes responsible for intellectual deficiencies. We use Inductive Logic Programming, a suitable method to mine relational data such as LOD while exploiting domain knowledge from ontologies by using reasoning mechanisms. Here, ILP allows to extract from LOD and ontologies a descriptive and predictive model of genes responsible for intellectual disabilities. These contributions illustrates the possibility of having several ontologies cooperate to improve various data mining processes
42

SIMCOP: Um Framework para Análise de Similaridade em Sequências de Contextos

Wiedemann, Tiago 28 March 2014 (has links)
Submitted by Fabricia Fialho Reginato (fabriciar) on 2015-08-26T23:54:22Z No. of bitstreams: 1 TiagoWiedemann.pdf: 5635533 bytes, checksum: c0d3805abbcaf56aa36da4d8422457b6 (MD5) / Made available in DSpace on 2015-08-26T23:54:22Z (GMT). No. of bitstreams: 1 TiagoWiedemann.pdf: 5635533 bytes, checksum: c0d3805abbcaf56aa36da4d8422457b6 (MD5) Previous issue date: 2014 / CAPES - Coordenação de Aperfeiçoamento de Pessoal de Nível Superior / CNPQ – Conselho Nacional de Desenvolvimento Científico e Tecnológico / A Computação Ubíqua, que estuda formas de integrar a tecnologia ao cotidiano das pessoas, é uma área que vem crescendo nos últimos anos, especialmente devido ao desenvolvimento de tecnologias como a computação móvel. Um dos aspectos fundamentais para o desenvolvimento deste tipo de aplicação é a questão da Sensibilidade ao Contexto, que permite a uma aplicação adaptar o seu funcionamento conforme a situação na qual o usuário se encontra no momento. Com esta finalidade, diversos autores apresentaram definições formais sobre o que é um contexto e como representá-lo. A partir desta formalização começaram a ser desenvolvidas técnicas para análise de dados contextuais que propunham a realização de predições e inferências, entre outras análises. Esta dissertação especifica um framework denominado SIMCOP (SIMilar Context Path) para a realização da análise de similaridade entre sequências de contextos visitados por uma entidade. Este tipo de análise permite a identificação de contextos semelhantes com a intenção de prover funcionalidades como a recomendação de entidades e/ou contextos, a classificação de entidades e a predição de contextos. Um protótipo do framework foi implementado, e a partir dele foram desenvolvidas duas aplicações de recomendação, uma delas por um desenvolvedor independente, através do qual foi possível avaliar a eficácia do framework. Com o desenvolvimento desta pesquisa comprovou-se, conforme demonstrado nas avaliações realizadas, que a análise de similaridade de contextos pode ser útil em outras áreas além da computação ubíqua, como a mineração de dados e os sistemas de filtragem colaborativa, entre outras áreas, onde qualquer conjunto de dados que puder ser descrito na forma de um contexto, poderá ser analisado através das técnicas de análise de similaridade implementadas pelo framework. / The Ubiquitous Computing, that studies the ways to integrate technology into the people’s everyday life, is an area that has been growing in recent years, especially due to the development of technologies such as mobile computing. A key for the development of this type of application is the issue of context awareness, which enables an application to self adapt to the situation in which the user is currently on. To make this possible, it was necessary to formally define what is a context and how to represent it . From this formalization, techniques for analyzing contextual data have been proposed for development of functions as predictions or inferences. This paper specifies a framework called SIMCOP (SIMilar Context Path ) for performing the analysis of similarity between sequences of contexts visited by an entity. This type of analysis enables the identification of similar contexts with the intention to provide features such as the recommendation of entities and contexts, the entities classification and the prediction of contexts. The development of this research shows that the contexts similarity analysis can be useful in other areas further the ubiquitous computing, such as data mining and collaborative filtering systems. Any data type that can be described as a context, can be analyzed through the techniques of similarity analysis implemented by the framework, as demonstrated in the assessments.
43

Syntactic and Semantic Analysis and Visualization of Unstructured English Texts

Karmakar, Saurav 14 December 2011 (has links)
People have complex thoughts, and they often express their thoughts with complex sentences using natural languages. This complexity may facilitate efficient communications among the audience with the same knowledge base. But on the other hand, for a different or new audience this composition becomes cumbersome to understand and analyze. Analysis of such compositions using syntactic or semantic measures is a challenging job and defines the base step for natural language processing. In this dissertation I explore and propose a number of new techniques to analyze and visualize the syntactic and semantic patterns of unstructured English texts. The syntactic analysis is done through a proposed visualization technique which categorizes and compares different English compositions based on their different reading complexity metrics. For the semantic analysis I use Latent Semantic Analysis (LSA) to analyze the hidden patterns in complex compositions. I have used this technique to analyze comments from a social visualization web site for detecting the irrelevant ones (e.g., spam). The patterns of collaborations are also studied through statistical analysis. Word sense disambiguation is used to figure out the correct sense of a word in a sentence or composition. Using textual similarity measure, based on the different word similarity measures and word sense disambiguation on collaborative text snippets from social collaborative environment, reveals a direction to untie the knots of complex hidden patterns of collaboration.
44

Εξόρυξη θεματικών αλυσίδων από ιστοσελίδες για την δημιουργία ενός θεματολογικά προσανατολισμένου προσκομιστή / Lexical chain extraction for the creation of a topical focused crawler

Κοκόσης, Παύλος 16 May 2007 (has links)
Οι θεματολογικά προσανατολισμένοι προσκομιστές είναι εφαρμογές που έχουν στόχο την συλλογή ιστοσελίδων συγκεκριμένης θεματολογίας από τον Παγκόσμιο Ιστό. Αποτελούν ένα ανοικτό ερευνητικό πεδίο των τελευταίων χρόνων. Σε αυτήν την διπλωματική εργασία επιχειρείται η υλοποίηση ενός θεματολογικά προσανατολισμένου προσκομιστή με χρήση λεξικών αλυσίδων. Οι λεξικές αλυσίδες είναι ένα σημαντικό λεξιλογικό και υπολογιστικό εργαλείο για την αναπαράσταση της έννοιας ενός κειμένου. Έχουν χρησιμοποιηθεί με επιτυχία στην αυτόματη δημιουργία περιλήψεων για κείμενα, αλλά και στην κατηγοριοποίησή τους σε θεματικές κατηγορίες. Παρουσιάζουμε τις διαδικασίες βαθμολόγησης συνδέσμων και ιστοσελίδων, καθώς και τον υπολογισμό της σημασιολογικής ομοιότητας μεταξύ κειμένων με χρήση λεξικών αλυσίδων. Συνδυάζουμε και ενσωματώνουμε αυτές τις διαδικασίες σε έναν θεματολογικά προσανατολισμένο προσκομιστή, τα πειραματικά αποτελέσματα του οποίου είναι πολλά υποσχόμενα. / Topical focused crawlers are applications that aim at collecting web pages of a specific topic from the Web. Building topical focused crawlers is an open research field. In this master thesis we develop a topical focused crawler using lexical chains. Lexical chains are an important lexical and computational tool which is used for representing the meaning of text. They have been used with success in automatic text summarization and text classification in thematic categories. We present the processes of hyperlink and web page scoring, as well as the computation of the semantic similarity between documents by using lexical chains. Combining the aforementioned methods we embody them in a topical focused crawler. Its results are very promising.
45

Αυτόματη εξαγωγή λεξικής - σημασιολογικής γνώσης από ηλεκτρονικά σώματα κειμένων με χρήση ελαχίστων πόρων / Automatic extraction of lexico - semantic knowledge from electronic text corpora using minimal resources

Θανόπουλος, Αριστομένης 25 June 2007 (has links)
Το αντικείμενο της διατριβής είναι η μελέτη μεθόδων αυτόματης εξαγωγής των συμφράσεων και των σημασιολογικών ομοιοτήτων των λέξεων από μεγάλα σώματα κειμένων. Υιοθετείται μια προσέγγιση ελάχιστων γλωσσικών πόρων που εξασφαλίζει την απεριόριστη μεταφερσιμότητα των μεθόδων σε φυσικές γλώσσες και θεματικές περιοχές. Για την αξιολόγηση των προτεινόμενων μεθόδων προτείνονται, αξιολογούνται και εφαρμόζονται μεθοδολογίες με βάση πρότυπες βάσεις λεξικής γνώσης (στην Αγγλική), όπως το WordNet. Για την εξαγωγή των συμφράσεων προτείνονται νέα μέτρα εξαγωγής στατιστικά σημαντικών διγράμμων και γενικά ν-γράμμων που αξιολογούνται θετικά. Για την εξαγωγή των λεξικών - σημασιολογικών ομοιοτήτων των λέξεων ακολουθείται καταρχήν η προσέγγιση ομοιότητας περικειμένων λέξεων με παραθυρικές μεθόδους, όπου μελετώνται το πεδίο συμφραζομένων, το φιλτράρισμα των συνεμφανίσεων των λέξεων, τα μέτρα ομοιότητας, όπου εισάγεται ο παράγοντας του αριθμού κοινών παραμέτρων, καθώς και η αντιμετώπιση συστηματικών σφαλμάτων, ενώ προτείνεται η αξιοποίηση των λειτουργικών λέξεων. Επιπλέον, προτείνεται η αξιοποίηση της ομοιότητας περικείμενων εκφράσεων, που απαντάται συχνά σε θεματικώς εστιασμένα κείμενα, με ένα αλγόριθμο βασισμένο στην ετεροσυσχέτιση ακολουθιών λέξεων. Μελετάται η μεθοδολογία αξιοποίησης των παρατακτικών συνδέσεων ενώ προτείνεται μια μέθοδος ενοποίησης ετερογενών σωμάτων γνώσης λεξικών – σημασιολογικών ομοιοτήτων. Τέλος, η εξαχθείσα γνώση μετασχηματίζεται σε σημασιολογικές κλάσεις με μια συμβολική μέθοδο ιεραρχικής ομαδοποίησης και επίσης ενσωματώνεται επιτυχώς σε ένα διαλογικό σύστημα μηχανικής μάθησης όπου ενισχύει την απόδοση της αναγνώρισης του σκοπού του χρήστη συμβάλλοντας στην εκτίμηση του ρόλου των άγνωστων λέξεων. / The research described in this dissertation regards automatic extraction of collocations and lexico-semantic similarities from large text corpora. We follow an approach based on minimal linguistic resources in order to achieve unrestricted portability across languages and thematic domains. In order to evaluate the proposed methods we propose, evaluate and apply methodologies based on English gold standard lexical resources, such as WordNet. For the extraction of collocations we propose and test a few novel measures for the identification of statistically significant bigrams and, generally, n-grams, which exhibit strong performance. For the extraction of lexico-semantic similarities we follow a distributional window-based approach. We study the contextual scope, the filtering of lexical co-occurrences and the performance of similarity measures. We propose the incorporation of the number of common parameters into the latter, the exploitation of functional words and a method for the elimination of systematic errors. Moreover, we propose a novel approach to exploitation of word sequence similarities, common in technical texts, based on cross-correlation of word sequences. We refine an approach for word similarity extraction from coordinations and we propose a method for the amalgamation of lexico-semantic similarity databases extracted via different principles and methods. Finally, the extracted similarity knowledge is transformed in the form of soft hierarchical semantic clusters and it is successfully incorporated into a machine learning based dialogue system, reinforcing the performance of user’s plan recognition by estimating the semantic role of unknown words.
46

Μελέτη και συγκριτική αξιολόγηση μεθόδων δόμησης περιεχομένου ιστοτόπων : εφαρμογή σε ειδησεογραφικούς ιστοτόπους

Στογιάννος, Νικόλαος-Αλέξανδρος 20 April 2011 (has links)
Η κατάλληλη οργάνωση του περιεχομένου ενός ιστοτόπου, έτσι ώστε να αυξάνεται η ευρεσιμότητα των πληροφοριών και να διευκολύνεται η επιτυχής ολοκλήρωση των τυπικών εργασιών των χρηστών, αποτελεί έναν από τους πρωταρχικούς στόχους των σχεδιαστών ιστοτόπων. Οι υπάρχουσες τεχνικές του πεδίου Αλληλεπίδρασης-Ανθρώπου Υπολογιστή που συνεισφέρουν στην επίτευξη αυτού του στόχου συχνά αγνοούνται εξαιτίας των απαιτήσεών τους σε χρονικούς και οικονομικούς πόρους. Ειδικότερα για ειδησεογραφικούς ιστοτόπους, τόσο το μέγεθος τους όσο και η καθημερινή προσθήκη και τροποποίηση των παρεχόμενων πληροφοριών, καθιστούν αναγκαία τη χρήση αποδοτικότερων τεχνικών για την οργάνωση του περιεχομένου τους. Στην εργασία αυτή διερευνούμε την αποτελεσματικότητα μίας μεθόδου, επονομαζόμενης AutoCardSorter, που έχει προταθεί στη βιβλιογραφία για την ημιαυτόματη κατηγοριοποίηση ιστοσελίδων, βάσει των σημασιολογικών συσχετίσεων του περιεχομένου τους, στο πλαίσιο οργάνωσης των πληροφοριών ειδησεογραφικών ιστοτόπων. Για το σκοπό αυτό διενεργήθηκαν πέντε συνολικά μελέτες, στις οποίες πραγματοποιήθηκε τόσο ποσοτική όσο και ποιοτική σύγκριση των κατηγοριοποιήσεων που προέκυψαν από συμμετέχοντες σε αντίστοιχες μελέτες ταξινόμησης καρτών ανοικτού και κλειστού τύπου, με τα αποτελέσματα της τεχνικής AutoCardSorter. Από την ανάλυση των αποτελεσμάτων προέκυψε ότι η AutoCardSorter παρήγαγε ομαδοποιήσεις άρθρων που βρίσκονται σε μεγάλη συμφωνία με αυτές των συμμετεχόντων στις μελέτες, αλλά με σημαντικά αποδοτικότερο τρόπο, επιβεβαιώνοντας προηγούμενες παρόμοιες μελέτες σε ιστοτόπους άλλων θεματικών κατηγοριών. Επιπρόσθετα, οι μελέτες έδειξαν ότι μία ελαφρώς τροποποιημένη εκδοχή της AutoCardSorter τοποθετεί νέα άρθρα σε προϋπάρχουσες κατηγορίες με αρκετά μικρότερο ποσοστό συμφωνίας συγκριτικά με τον τρόπο που επέλεξαν οι συμμετέχοντες. Η εργασία ολοκληρώνεται με την παρουσίαση κατευθύνσεων για την βελτίωση της αποτελεσματικότητας της AutoCardSorter, τόσο στο πλαίσιο οργάνωσης του περιεχομένου ειδησεογραφικών ιστοτόπων όσο και γενικότερα. / The proper structure of a website's content, so as to increase the findability of the information provided and to ease the typical user task-making, is one of the primary goals of website designers. The existing methods from the field of HCI that assist designers in this, are often neglected due to their high cost and human resources demanded. Even more so on News Sites, their size and the daily content updating call for improved and more efficient techniques. In this thesis we investigate the efficiency of a novel method, called AutoCardSorter, that has been suggested in bibliography for the semi-automatic content categorisation based on the semantic similarity of each webpage-content. To accomplish this we conducted five comparative studies in which the method was compared, to the primary alternatives of the classic Card Sorting method (open, closed). The analysis of the results showed that AutoCardSorter suggested article categories with high relavance to the ones suggested from a group of human subjects participating in the CardSort studies, although in a much more efficient way. This confirms the results of similar previous studies on websites of other themes (eg. travel, education). Moreover, the studies showed that a modified version of the method places articles under pre-existing categories with significant less relavance to the categorisation suggested by the participants. The thesis is concluded with the proposal of different ways to improve the proposed method's efficiency, both in the content of News Sites and in general.
47

SOM4SImD : um método semântico baseado em ontologia para detectar similaridade entre documentos

Arruda, Claudineia Gonçalves de 13 February 2017 (has links)
Submitted by Alison Vanceto (alison-vanceto@hotmail.com) on 2017-08-08T17:30:10Z No. of bitstreams: 1 DissCGA.pdf: 1377116 bytes, checksum: eeaa4d5429ed9fe1aeac6a215d0acc52 (MD5) / Approved for entry into archive by Ronildo Prado (ronisp@ufscar.br) on 2017-08-09T14:09:22Z (GMT) No. of bitstreams: 1 DissCGA.pdf: 1377116 bytes, checksum: eeaa4d5429ed9fe1aeac6a215d0acc52 (MD5) / Approved for entry into archive by Ronildo Prado (ronisp@ufscar.br) on 2017-08-09T14:09:30Z (GMT) No. of bitstreams: 1 DissCGA.pdf: 1377116 bytes, checksum: eeaa4d5429ed9fe1aeac6a215d0acc52 (MD5) / Made available in DSpace on 2017-08-09T14:17:28Z (GMT). No. of bitstreams: 1 DissCGA.pdf: 1377116 bytes, checksum: eeaa4d5429ed9fe1aeac6a215d0acc52 (MD5) Previous issue date: 2017-02-13 / Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES) / In several research areas, interviews are a means of obtaining data widely used by researchers. These interviews are arranged, in most cases, in several documents and have an informal language, because they are conversations between several people at the same time. Analyzing such documents is an arduous and time-consuming task, bringing fatigue and difficulties to a correct analysis. One solution for analyzing this type of interview is to group documents according to the similarity between them, so that experts can analyze documents of similar subjects more quickly. In this way, this work presents the method SOM4SImD, created to detect the semantic similarity between the documents composed by interviews with an informal language written in Brazilian Portuguese. In order to create this method, an ontology of the same document domain was used, which allowed the use of the formal terms of the ontology, along with its synonyms and variants, to perform the semantic annotation in the documents and to calculate the similarity between the interview pairs. Through the created method, a SimIGroup approach was developed that assists the researchers in the qualitative analysis of the documents, using Coding technique. The results show that the SOM4SImD method and the SimIGroup approach reduce the difficulties and fatigue in the analysis of the documents made by the annotators, helping to increase the number of documents analyzed. In addition, the SOM4SImD method was more advantageous in obtaining similarity between documents than the others found in the literature, reaching significant values for the performance measures, with 0.96 accuracy, 0.93 of recall and 0.94 of F-Mensure. / Em diversas áreas de pesquisas, as entrevistas são um meio de obtenção de dados muito utilizadas por pesquisadores. Essas entrevistas são dispostas, na maioria das vezes, em diversos documentos e têm uma linguagem informal, por se tratar de conversas entre várias pessoas ao mesmo tempo. Analisar tais documentos é uma tarefa árdua e demorada, trazendo cansaço e dificuldades para uma análise correta. Uma solução para análise desse tipo de entrevistas é agrupar os documentos de acordo com a similaridade que existem entre eles, pois assim os especialistas conseguem analisar os documentos de assuntos parecidos de forma mais rápida. Desta forma, este trabalho apresenta o método SOM4SImD, criado para detectar a similaridade semântica entre os documentos compostos por entrevistas com uma linguagem informal escritas no português brasileiro. Para criar este método, foi utilizado uma ontologia de mesmo domínio dos documentos, que permitiu o uso dos termos formais da ontologia, juntamente com seus sinônimos e variantes para realizar a anotação semântica nos documentos e para realizar o cálculo da similaridade entre os pares de entrevistas. Através do método criado, foi desenvolvida uma abordagem SimIGroup que auxilia os pesquisadores na análise qualitativa dos documentos, utilizando a técnica Coding. Os resultados mostram que o método SOM4SImD e a abordagem SimIGroup diminuem as dificuldades e cansaço na análise dos documentos realizadas pelos anotadores, auxiliando no aumento da quantidade de documentos analisados. Além disso, o método SOM4SImD se mostrou mais vantajoso na obtenção de similaridade entre documentos do que os demais encontrados na literatura, alcançando valores significantes para as medidas de desempenho, com 0,96 de precisão, 0,93 de revocação e 0,94 de F-Mensure.
48

Aprendizado automático de relações semânticas entre tags de folksonomias.

RÊGO, Alex Sandro da Cunha. 05 June 2018 (has links)
Submitted by Maria Medeiros (maria.dilva1@ufcg.edu.br) on 2018-06-05T14:49:44Z No. of bitstreams: 1 ALEX SANDRO DA CUNHA RÊGO - TESE (PPGCC) 2016.pdf: 1783053 bytes, checksum: 4ae3b5d42dde739cfd57afaa25fd7e63 (MD5) / Made available in DSpace on 2018-06-05T14:49:44Z (GMT). No. of bitstreams: 1 ALEX SANDRO DA CUNHA RÊGO - TESE (PPGCC) 2016.pdf: 1783053 bytes, checksum: 4ae3b5d42dde739cfd57afaa25fd7e63 (MD5) Previous issue date: 2016 / As folksonomias têm despontado como ferramentas úteis de gerenciamento online de conteúdo digital. A exemplo dos populares websites Delicious, Flickr e BibSonomy, diariamente os usuários utilizam esses sistemas para efetuar upload de recursos web (e.g., url, fotos, vídeos e referências bibliográficas) e categorizá-los por meio de tags. A ausência de relações semânticas do tipo sinonímia e hiperonímia/hiponímia no espaço de tags das folksonomias reduz a capacidade do usuário de encontrar recursos relevantes. Para mitigar esse problema, muitos trabalhos de pesquisa se apoiam na aplicação de medidas de similaridade para detecção de sinonímia e construção automática de hierarquias de tags por meio de algoritmos heurísticos. Nesta tese de doutorado, o problema de detecção de sinonímia e hiperonímia/hiponímia entre pares de tags é modelado como um problema de classificação em Aprendizado de Máquina. A partir da literatura, várias medidas de similaridade consideradas boas indicadoras de sinonímia e hiperonímia/hiponímia foram identificadas e empregadas como atributos de aprendizagem. A incidência de um severo desbalanceamento e sobreposição de classes motivou a investigação de técnicas de balanceamento para superar ambos os problemas. Resultados experimentais usando dados reais das folksonomias BibSonomy e Delicious mostraram que a abordagem proposta denominada CPDST supera em termos de acurácia o baseline de melhor desempenho nas tarefas de detecção de sinonímia e hiperonímia/hiponímia. Também, aplicou-se a abordagem CPDST no contexto de geração de listas de tags semanticamente relacionadas, com o intuito de prover acesso a recursos adicionais anotados com outros conceitos pertencentes ao domínio da busca. Além da abordagem CPDST, foram propostos dois algoritmos fundamentados no acesso ao WordNet e ConceptNet para sugestão de listas especializadas com tags sinônimas e hipônimas. O resultado de uma avaliação quantitativa demonstrou que a abordagem CPDST provê listas de tags relevantes em relação às listas providas pelos métodos comparados. / Folksonomies have emerged as useful tools for online management of digital content. Popular websites as Delicious, Flickr and BibSonomy are now widespread with thousands of users using them daily to upload digital content (e.g., webpages, photos, videos and bibliographic information) and tagging for later retrieval. The lack of semantic relations such as synonym and hypernym/hyponym in the tag space may diminish the ability of users in finding relevant resources. Many research works in the literature employ similarity measures to detect synonymy and to build hierarchies of tags automatically by means of heuristic algorithms. In this thesis, the problems of synonym and subsumption detection between pairs of tags are cast as a pairwise classification problem. From the literature, several similarity measures that are good indicators of synonymy and subsumption were identified, which are used as learning features. Under this setting, there is a severe class imbalance and class overlapping which motivated us to investigate and employ class imbalance techniques to overcome these problems. A comprehensive set of experiments were conducted on two large real-world datasets of BibSonomy and Delicious systems, showing that the proposed approach named CPDST outperforms the best performing heuristic-based baseline in the tasks of synonym and subsumption detection. CPDST is also applied in the context of tag list generation for providing access to additional resources annotated with other semantically related tags. Besides CPDST approach, two algorithms based on WordNet and ConceptNet accesses are proposed for capturing specifically synonyms and hyponyms. The outcome of an evaluative quantitative analysis showed that CPDST approach yields relevant tag lists in relation to the produced ones by the compared methods.
49

Unsupervised Information Extraction From Text – Extraction and Clustering of Relations between Entities / Extraction d'Information Non Supervisée à Partir de Textes – Extraction et Regroupement de Relations entre Entités

Wang, Wei 16 May 2013 (has links)
L'extraction d'information non supervisée en domaine ouvert est une évolution récente de l'extraction d'information adaptée à des contextes dans lesquels le besoin informationnel est faiblement spécifié. Dans ce cadre, la thèse se concentre plus particulièrement sur l'extraction et le regroupement de relations entre entités en se donnant la possibilité de traiter des volumes importants de données.L'extraction de relations se fixe plus précisément pour objectif de faire émerger des relations de type non prédéfini à partir de textes. Ces relations sont de nature semi-structurée : elles associent des éléments faisant référence à des structures de connaissance définies a priori, dans le cas présent les entités qu’elles relient, et des éléments donnés uniquement sous la forme d’une caractérisation linguistique, en l’occurrence leur type. Leur extraction est réalisée en deux temps : des relations candidates sont d'abord extraites sur la base de critères simples mais efficaces pour être ensuite filtrées selon des critères plus avancés. Ce filtrage associe lui-même deux étapes : une première étape utilise des heuristiques pour éliminer rapidement les fausses relations en conservant un bon rappel tandis qu'une seconde étape se fonde sur des modèles statistiques pour raffiner la sélection des relations candidates.Le regroupement de relations a quant à lui un double objectif : d’une part, organiser les relations extraites pour en caractériser le type au travers du regroupement des relations sémantiquement équivalentes et d’autre part, en offrir une vue synthétique. Il est réalisé dans le cas présent selon une stratégie multiniveau permettant de prendre en compte à la fois un volume important de relations et des critères de regroupement élaborés. Un premier niveau de regroupement, dit de base, réunit des relations proches par leur expression linguistique grâce à une mesure de similarité vectorielle appliquée à une représentation de type « sac-de-mots » pour former des clusters fortement homogènes. Un second niveau de regroupement est ensuite appliqué pour traiter des phénomènes plus sémantiques tels que la synonymie et la paraphrase et fusionner des clusters de base recouvrant des relations équivalentes sur le plan sémantique. Ce second niveau s'appuie sur la définition de mesures de similarité au niveau des mots, des relations et des clusters de relations en exploitant soit des ressources de type WordNet, soit des thésaurus distributionnels. Enfin, le travail illustre l’intérêt de la mise en œuvre d’un clustering des relations opéré selon une dimension thématique, en complément de la dimension sémantique des regroupements évoqués précédemment. Ce clustering est réalisé de façon indirecte au travers du regroupement des contextes thématiques textuels des relations. Il offre à la fois un axe supplémentaire de structuration des relations facilitant leur appréhension globale mais également le moyen d’invalider certains regroupements sémantiques fondés sur des termes polysémiques utilisés avec des sens différents. La thèse aborde également le problème de l'évaluation de l'extraction d'information non supervisée par l'entremise de mesures internes et externes. Pour les mesures externes, une méthode interactive est proposée pour construire manuellement un large ensemble de clusters de référence. Son application sur un corpus journalistique de grande taille a donné lieu à la construction d'une référence vis-à-vis de laquelle les différentes méthodes de regroupement proposées dans la thèse ont été évaluées. / Unsupervised information extraction in open domain gains more and more importance recently by loosening the constraints on the strict definition of the extracted information and allowing to design more open information extraction systems. In this new domain of unsupervised information extraction, this thesis focuses on the tasks of extraction and clustering of relations between entities at a large scale. The objective of relation extraction is to discover unknown relations from texts. A relation prototype is first defined, with which candidates of relation instances are initially extracted with a minimal criterion. To guarantee the validity of the extracted relation instances, a two-step filtering procedures is applied: the first step with filtering heuristics to remove efficiently large amount of false relations and the second step with statistical models to refine the relation candidate selection. The objective of relation clustering is to organize extracted relation instances into clusters so that their relation types can be characterized by the formed clusters and a synthetic view can be offered to end-users. A multi-level clustering procedure is design, which allows to take into account the massive data and diverse linguistic phenomena at the same time. First, the basic clustering groups similar relation instances by their linguistic expressions using only simple similarity measures on a bag-of-word representation for relation instances to form high-homogeneous basic clusters. Second, the semantic clustering aims at grouping basic clusters whose relation instances share the same semantic meaning, dealing with more particularly phenomena such as synonymy or more complex paraphrase. Different similarities measures, either based on resources such as WordNet or distributional thesaurus, at the level of words, relation instances and basic clusters are analyzed. Moreover, a topic-based relation clustering is proposed to consider thematic information in relation clustering so that more precise semantic clusters can be formed. Finally, the thesis also tackles the problem of clustering evaluation in the context of unsupervised information extraction, using both internal and external measures. For the evaluations with external measures, an interactive and efficient way of building reference of relation clusters proposed. The application of this method on a newspaper corpus results in a large reference, based on which different clustering methods are evaluated.
50

Recomendação de conteúdo baseada em informações semânticas extraídas de bases de conhecimento / Content recommendation based on semantic information extracted from knowledge bases

Salmo Marques da Silva Junior 10 May 2017 (has links)
A fim de auxiliar usuários durante o consumo de produtos, sistemas Web passaram a incorporar módulos de recomendação de itens. As abordagens mais populares são a baseada em conteúdo, que recomenda itens a partir de características que são do seu interesse, e a filtragem colaborativa, que recomenda itens bem avaliados por usuários com perfis semelhantes ao do usuário alvo, ou que são semelhantes aos que foram bem avaliados pelo usuário alvo. Enquanto que a primeira abordagem apresenta limitações como a sobre-especialização e a análise limitada de conteúdo, a segunda enfrenta problemas como o novo usuário e/ou novo item, também conhecido como partida fria. Apesar da variedade de técnicas disponíveis, um problema comum existente na maioria das abordagens é a falta de informações semânticas para representar os itens do acervo. Trabalhos recentes na área de Sistemas de Recomendação têm estudado a possibilidade de usar bases de conhecimento da Web como fonte de informações semânticas. Contudo, ainda é necessário investigar como usufruir de tais informações e integrá-las de modo eficiente em sistemas de recomendação. Dessa maneira, este trabalho tem o objetivo de investigar como informações semânticas provenientes de bases de conhecimento podem beneficiar sistemas de recomendação por meio da descrição semântica de itens, e como o cálculo da similaridade semântica pode amenizar o desafio enfrentado no cenário de partida fria. Como resultado, obtém-se uma técnica que pode gerar recomendações adequadas ao perfil dos usuários, incluindo itens novos do acervo que sejam relevantes. Pode-se observar uma melhora de até 10% no RMSE, no cenário de partida fria, quando se compara o sistema proposto com o sistema cuja predição de notas é baseada na correlação de notas. / In order to support users during the consumption of products,Web systems have incorporated recommendation techniques. The most popular approaches are content-based, which recommends items based on interesting features to the user, and collaborative filtering, which recommends items that were well evaluated by users with similar preferences to the target user, or that have similar features to items which were positively evaluated. While the first approach has limitations such as overspecialization and limited content analysis, the second technique has problems such as the new user and the new item, limitation also known as cold start. In spite of the variety of techniques available, a common problem is the lack of semantic information to represent items features. Recent works in the field of recommender systems have been studying the possibility to use knowledge databases from the Web as a source of semantic information. However, it is still necessary to investigate how to use and integrate such semantic information in recommender systems. In this way, this work has the proposal to investigate how semantic information gathered from knowledge databases can help recommender systems by semantically describing items, and how semantic similarity can overcome the challenge confronted in the cold-start scenario. As a result, we obtained a technique that can produce recommendations suited to users profiles, including relevant new items available in the database. It can be observed an improvement of up to 10% in the RMSE in the cold start scenario when comparing the proposed system with the system whose rating prediction is based on the correlation of rates.

Page generated in 0.2665 seconds