Return to search

Evaluating Hierarchical LDA Topic Models for Article Categorization

With the vast amount of information available on the Internet today, helping users find relevant content has become a prioritized task in many software products that recommend news articles. One such product is Opera for Android, which has a news feed containing articles the user may be interested in. In order to easily determine what articles to recommend, they can be categorized by the topics they contain. One approach of categorizing articles is using Machine Learning and Natural Language Processing (NLP). A commonly used model is Latent Dirichlet Allocation (LDA), which finds latent topics within large datasets of for example text articles. An extension of LDA is hierarchical Latent Dirichlet Allocation (hLDA) which is an hierarchical variant of LDA. In hLDA, the latent topics found among a set of articles are structured hierarchically in a tree. Each node represents a topic, and the levels represent different levels of abstraction in the topics. A further extension of hLDA is constrained hLDA, where a set of predefined, constrained topics are added to the tree. The constrained topics are extracted from the dataset by grouping highly correlated words. The idea of constrained hLDA is to improve the topic structure derived by a hLDA model by making the process semi-supervised. The aim of this thesis is to create a hLDA and a constrained hLDA model from a dataset of articles provided by Opera. The models should then be evaluated using the novel metric word frequency similarity, which is a measure of the similarity between the words representing the parent and child topics in a hierarchical topic model. The results show that word frequency similarity can be used to evaluate whether the topics in a parent-child topic pair are too similar, so that the child does not specify a subtopic of the parent. It can also be used to evaluate if the topics are too dissimilar, so that the topics seem unrelated and perhaps should not be connected in the hierarchy. The results also show that the two topic models created had comparable word frequency similarity scores. None of the models seemed to significantly outperform the other with regard to the metric.

Identiferoai:union.ndltd.org:UPSALLA1/oai:DiVA.org:liu-167080
Date January 2020
CreatorsLindgren, Jennifer
PublisherLinköpings universitet, Institutionen för datavetenskap
Source SetsDiVA Archive at Upsalla University
LanguageEnglish
Detected LanguageEnglish
TypeStudent thesis, info:eu-repo/semantics/bachelorThesis, text
Formatapplication/pdf
Rightsinfo:eu-repo/semantics/openAccess

Page generated in 0.0028 seconds