• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 1
  • 1
  • Tagged with
  • 2
  • 2
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
1

Using WordNet Synonyms and Hypernyms in Automatic Topic Detection

Wargärde, Nicko January 2020 (has links)
Detecting topics by extracting keywords from written text using TF-IDF has been studied and successfully used in many applications. Adding a semantic layer to TF-IDF-based topic detection using WordNet synonyms and hypernyms has been explored in document clustering by assigning concepts that describe texts or by adding all synonyms and hypernyms that occurring words have to a list of keywords. A new method where TF-IDF scores are calculated and WordNet synset members’ TF-IDFscores are added together to all occurring synonyms and/or hypernyms is explored in this paper. Here, such an approach is evaluated by comparing extracted keywords using TF-IDF and the new proposed method, SynPlusTF-IDF, against manually assigned keywords in a database of scientific abstracts. As topic detection is widely used in many contexts and applications, improving current methods is of great value as the methods can become more accurate at extracting correct and relevant keywords from written text. An experiment was conducted comparing the two methods and their accuracy measured using precision and recall and by calculating F1-scores.The F1-scores ranged from 0.11131 to 0.14264 for different variables and the results show that SynPlusTF-IDF is not better at topic detection compared to TF-IDF and both methods performed poorly at topic detection with the chosen dataset.
2

Contribution à l’analyse sémantique des textes arabes

Lebboss, Georges 08 July 2016 (has links)
La langue arabe est pauvre en ressources sémantiques électroniques. Il y a bien la ressource Arabic WordNet, mais il est pauvre en mots et en relations. Cette thèse porte sur l’enrichissement d’Arabic WordNet par des synsets (un synset est un ensemble de mots synonymes) à partir d’un corpus général de grande taille. Ce type de corpus n’existe pas en arabe, il a donc fallu le construire, avant de lui faire subir un certain nombre de prétraitements.Nous avons élaboré, Gilles Bernard et moi-même, une méthode de vectorisation des mots, GraPaVec, qui puisse servir ici. J’ai donc construit un système incluant un module Add2Corpus, des prétraitements, une vectorisation des mots à l’aide de patterns fréquentiels générés automatiquement, qui aboutit à une matrice de données avec en ligne les mots et en colonne les patterns, chaque composante représente la fréquence du mot dans le pattern.Les vecteurs de mots sont soumis au modèle neuronal Self Organizing Map SOM ; la classification produite par SOM construit des synsets. Pour validation, il a fallu créer un corpus de référence (il n’en existe pas en arabe pour ce domaine) à partir d’Arabic WordNet, puis comparer la méthode GraPaVec avec Word2Vec et Glove. Le résultat montre que GraPaVec donne pour ce problème les meilleurs résultats avec une F-mesure supérieure de 25 % aux deux autres. Les classes produites seront utilisées pour créer de nouveaux synsets intégrés à Arabic WordNet / The Arabic language is poor in electronic semantic resources. Among those resources there is Arabic WordNet which is also poor in words and relationships.This thesis focuses on enriching Arabic WordNet by synsets (a synset is a set of synonymous words) taken from a large general corpus. This type of corpus does not exist in Arabic, so we had to build it, before subjecting it to a number of pretreatments.We developed, Gilles Bernard and myself, a method of word vectorization called GraPaVec which can be used here. I built a system which includes a module Add2Corpus, pretreatments, word vectorization using automatically generated frequency patterns, which yields a data matrix whose rows are the words and columns the patterns, each component representing the frequency of a word in a pattern.The word vectors are fed to the neural model Self Organizing Map (SOM) ;the classification produced constructs synsets. In order to validate the method, we had to create a gold standard corpus (there are none in Arabic for this area) from Arabic WordNet, and then compare the GraPaVec method with Word2Vec and Glove ones. The result shows that GraPaVec gives for this problem the best results with a F-measure 25 % higher than the others. The generated classes will be used to create new synsets to be included in Arabic WordNet.

Page generated in 0.045 seconds