1 |
Contextual lexicon-based sentiment analysis for social mediaMuhammad, Aminu January 2016 (has links)
Sentiment analysis concerns the computational study of opinions expressed in text. Social media domains provide a wealth of opinionated data, thus, creating a greater need for sentiment analysis. Typically, sentiment lexicons that capture term-sentiment association knowledge are commonly used to develop sentiment analysis systems. However, the nature of social media content calls for analysis methods and knowledge sources that are better able to adapt to changing vocabulary. Invariably existing sentiment lexicon knowledge cannot usefully handle social media vocabulary which is typically informal and changeable yet rich in sentiment. This, in turn, has implications on the analyser's ability to effectively capture the context therein and to interpret the sentiment polarity from the lexicons. In this thesis we use SentiWordNet, a popular sentiment-rich lexicon with a substantial vocabulary coverage and explore how to adapt it for social media sentiment analysis. Firstly, the thesis identifies a set of strategies to incorporate the effect of modifiers on sentiment-bearing terms (local context). These modifiers include: contextual valence shifters, non-lexical sentiment modifiers typical in social media and discourse structures. Secondly, the thesis introduces an approach in which a domain-specific lexicon is generated using a distant supervision method and integrated with a general-purpose lexicon, using a weighted strategy, to form a hybrid (domain-adapted) lexicon. This has the dual purpose of enriching term coverage of the general purpose lexicon with non-standard but sentiment-rich terms as well as adjusting sentiment semantics of terms. Here, we identified two term-sentiment association metrics based on Term Frequency and Inverse Document Frequency that are able to outperform the state-of-the-art Point-wise Mutual Information on social media data. As distant supervision may not be readily applicable on some social media domains, we explore the cross-domain transferability of a hybrid lexicon. Thirdly, we introduce an approach for improving distant-supervised sentiment classification with knowledge from local context analysis, domain-adapted (hybrid) and emotion lexicons. Finally, we conduct a comprehensive evaluation of all identified approaches using six sentiment-rich social media datasets.
|
2 |
Predicting Knowledge Base Revisions from Realtime Text StreamsKonovalov, Alexander 20 December 2018 (has links)
No description available.
|
3 |
[en] DISTANT SUPERVISION FOR RELATION EXTRACTION USING ONTOLOGY CLASS HIERARCHY-BASED FEATURES / [pt] SUPERVISÃO À DISTÂNCIA EM EXTRAÇÃO DE RELACIONAMENTOS USANDO CARACTERÍSTICAS BASEADAS EM HIERARQUIA DE CLASSES EM ONTOLOGIASPEDRO HENRIQUE RIBEIRO DE ASSIS 18 March 2015 (has links)
[pt] Extração de relacionamentos é uma etapa chave para o problema de
identificação de uma estrutura em um texto em formato de linguagem natural. Em
geral, estruturas são compostas por entidades e relacionamentos entre elas. As
propostas de solução com maior sucesso aplicam aprendizado de máquina
supervisionado a corpus anotados à mão para a criação de classificadores de alta
precisão. Embora alcancem boa robustez, corpus criados à mão não são escaláveis
por serem uma alternativa de grande custo. Neste trabalho, nós aplicamos um
paradigma alternativo para a criação de um número considerável de exemplos de
instâncias para classificação. Tal método é chamado de supervisão à distância. Em
conjunto com essa alternativa, usamos ontologias da Web semântica para propor e
usar novas características para treinar classificadores. Elas são baseadas na
estrutura e semântica descrita por ontologias onde recursos da Web semântica são
definidos. O uso de tais características tiveram grande impacto na precisão e recall
dos nossos classificadores finais. Neste trabalho, aplicamos nossa teoria em um
corpus extraído da Wikipedia. Alcançamos uma alta precisão e recall para um
número considerável de relacionamentos. / [en] Relation extraction is a key step for the problem of rendering a structure
from natural language text format. In general, structures are composed by entities
and relationships among them. The most successful approaches on relation
extraction apply supervised machine learning on hand-labeled corpus for creating
highly accurate classifiers. Although good robustness is achieved, hand-labeled
corpus are not scalable due to the expensive cost of its creation. In this work we
apply an alternative paradigm for creating a considerable number of examples of
instances for classification. Such method is called distant supervision. Along with
this alternative approach we adopt Semantic Web ontologies to propose and use
new features for training classifiers. Those features are based on the structure and
semantics described by ontologies where Semantic Web resources are defined.
The use of such features has a great impact on the precision and recall of our final
classifiers. In this work, we apply our theory on corpus extracted from Wikipedia.
We achieve a high precision and recall for a considerable number of relations.
|
4 |
Extrakce sémantických vztahů z nestrukturovaných dat v komerční sféře / Semantic relation extraction from unstructured data in the business domainRampula, Ilana January 2016 (has links)
Text analytics in the business domain is a growing field in research and practical applications. We chose to concentrate on Relation Extraction from unstructured data which was provided by a corporate partner. Analyzing text from this domain requires a different approach, counting with irregularities and domain specific attributes. In this thesis, we present two methods for relation extraction. The Snowball system and the Distant Supervision method were both adapted for the unique data. The methods were implemented to use both structured and unstructured data from the database of the company. Keywords: Information Retrieval, Relation Extraction, Text Analytics, Distant Supervision, Snowball
|
5 |
Leveraging distant supervision for improved named entity recognitionGhaddar, Abbas 03 1900 (has links)
Les techniques d'apprentissage profond ont fait un bond au cours des dernières années, et ont considérablement changé la manière dont les tâches de traitement automatique du langage naturel (TALN) sont traitées. En quelques années, les réseaux de neurones et les plongements de mots sont rapidement devenus des composants centraux à adopter dans le domaine.
La supervision distante (SD) est une technique connue en TALN qui consiste à générer automatiquement des données étiquetées à partir d'exemples partiellement annotés. Traditionnellement, ces données sont utilisées pour l'entraînement en l'absence d'annotations manuelles, ou comme données supplémentaires pour améliorer les performances de généralisation.
Dans cette thèse, nous étudions comment la supervision distante peut être utilisée dans un cadre d'un TALN moderne basé sur l'apprentissage profond. Puisque les algorithmes d'apprentissage profond s'améliorent lorsqu'une quantité massive de données est fournie (en particulier pour l'apprentissage des représentations), nous revisitons la génération automatique des données avec la supervision distante à partir de Wikipédia. On applique des post-traitements sur Wikipédia pour augmenter la quantité d'exemples annotés, tout en introduisant une quantité raisonnable de bruit.
Ensuite, nous explorons différentes méthodes d'utilisation de données obtenues par supervision distante pour l'apprentissage des représentations, principalement pour apprendre des représentations de mots classiques (statistiques) et contextuelles.
À cause de sa position centrale pour de nombreuses applications du TALN, nous choisissons la reconnaissance d'entité nommée (NER) comme tâche principale. Nous expérimentons avec des bancs d’essai NER standards et nous observons des performances état de l’art. Ce faisant, nous étudions un cadre plus intéressant, à savoir l'amélioration des performances inter-domaines (généralisation). / Recent years have seen a leap in deep learning techniques that greatly changed the way Natural Language Processing (NLP) tasks are tackled. In a couple of years, neural networks and word embeddings quickly became central components to be adopted in the domain.
Distant supervision (DS) is a well-used technique in NLP to produce labeled data from partially annotated examples. Traditionally, it was mainly used as training data in the absence of manual annotations, or as additional training data to improve generalization performances.
In this thesis, we study how distant supervision can be employed within a modern deep learning based NLP framework. As deep learning algorithms gets better when massive amount of data is provided (especially for representation learning), we revisit the task of generating distant supervision data from Wikipedia. We apply post-processing treatments on the original dump to further increase the quantity of labeled examples, while introducing a reasonable amount of noise.
Then, we explore different methods for using distant supervision data for representation learning, mainly to learn classic and contextualized word representations. Due to its importance as a basic component in many NLP applications, we choose Named-Entity Recognition (NER) as our main task. We experiment on standard NER benchmarks showing state-of-the-art performances. By doing so, we investigate a more interesting setting, that is, improving the cross-domain (generalization) performances.
|
6 |
Créer un corpus annoté en entités nommées avec Wikipédia et WikiData : de mauvais résultats et du potentielPagès, Lucas 04 1900 (has links)
Ce mémoire explore l'utilisation conjointe de WikiData et de Wikipédia pour créer une ressource d'entités nommées (NER) annotée : DataNER. Il fait suite aux travaux ayant utilisé les bases de connaissance Freebase et DBpedia et tente de les remplacer avec WikiData, une base de connaissances collaborative dont la croissance continue est garantie par une communauté active. Malheureusement, les résultats du processus proposé dans ce mémoire ne sont pas à la hauteur des attentes initiales.
Ce document décrit dans un premier temps la façon dont on construit DataNER. L'utilisation des ancres de Wikipédia permet d'identifier un grand nombre d'entités nommées dans la ressource et le programme NECKAr permet de les classifier parmi les classes LOC, PER, ORG et MISC en utilisant WikiData. On décrit de ce fait les détails de ce processus, dont la façon dont on utilise les données de Wikipédia et WikiData afin de produire de nouvelles entités nommées et comment calibrer les paramètres du processus de création de DataNER.
Dans un second temps, on compare DataNER à d'autres ressources similaires en utilisant des modèles de NER ainsi qu'avec des comparaisons manuelles. Ces comparaisons nous permettent de mettre en valeur différentes raisons pour lesquelles les données de DataNER ne sont pas d'aussi bonne qualité que celles de ces autres ressources.
On conclut de ce fait sur des pistes d'améliorations de DataNER ainsi que sur un commentaire sur le travail effectué, tout en insistant sur le potentiel de cette méthode de création de corpus. / This master's thesis explores the joint use of WikiData and Wikipedia to make an annotated named entities (NER) corpus : DataNER. It follows papers which have used the knowledge bases DBpedia and Freebase and attempts at replacing them with WikiData, a collaborative knowledge base with an active community guaranteeing its continuous growth. Unfortunately, the results of the process described in this thesis did not reach our initial expectations.
This document first describes the way in which we build DataNER. The use of Wikipedia anchors enable us to identify a significant quantity of named entities in the resource and the NECKAr toolkit labels them with classes LOC, PER, ORG and MISC using WikiData. Thus, we describe the details of the corpus making process, including the way in which we infer more named entities thanks to Wikipedia and WikiData, as well as how we calibrate the making of DataNER with all the information at our availability.
Secondly, we compare DataNER with other similar corpora using models trained on each of them, as well as manual comparisons. Those comparisons enable us to identify different reasons why the quality of DataNER does not match the one of those other corpora.
We conclude by giving ideas as to how to enhance the quality of DataNER, giving a more personal comment of the work that has been accomplished and insisting on the potential of using Wikipedia and WikiData to automatically create a corpus.
|
7 |
Boosting Supervised Neural Relation Extraction with Distant SupervisionDhyani, Dushyanta, Dhyani 24 August 2018 (has links)
No description available.
|
Page generated in 0.0984 seconds