Detecting topics by extracting keywords from written text using TF-IDF has been studied and successfully used in many applications. Adding a semantic layer to TF-IDF-based topic detection using WordNet synonyms and hypernyms has been explored in document clustering by assigning concepts that describe texts or by adding all synonyms and hypernyms that occurring words have to a list of keywords. A new method where TF-IDF scores are calculated and WordNet synset members’ TF-IDFscores are added together to all occurring synonyms and/or hypernyms is explored in this paper. Here, such an approach is evaluated by comparing extracted keywords using TF-IDF and the new proposed method, SynPlusTF-IDF, against manually assigned keywords in a database of scientific abstracts. As topic detection is widely used in many contexts and applications, improving current methods is of great value as the methods can become more accurate at extracting correct and relevant keywords from written text. An experiment was conducted comparing the two methods and their accuracy measured using precision and recall and by calculating F1-scores.The F1-scores ranged from 0.11131 to 0.14264 for different variables and the results show that SynPlusTF-IDF is not better at topic detection compared to TF-IDF and both methods performed poorly at topic detection with the chosen dataset.
Identifer | oai:union.ndltd.org:UPSALLA1/oai:DiVA.org:lnu-97745 |
Date | January 2020 |
Creators | Wargärde, Nicko |
Publisher | Linnéuniversitetet, Institutionen för datavetenskap och medieteknik (DM) |
Source Sets | DiVA Archive at Upsalla University |
Language | English |
Detected Language | English |
Type | Student thesis, info:eu-repo/semantics/bachelorThesis, text |
Format | application/pdf |
Rights | info:eu-repo/semantics/openAccess |
Page generated in 0.0018 seconds