With the vast amount of information available on the Internet today, helping users find relevant content has become a prioritized task in many software products that recommend news articles. One such product is Opera for Android, which has a news feed containing articles the user may be interested in. In order to easily determine what articles to recommend, they can be categorized by the topics they contain. One approach of categorizing articles is using Machine Learning and Natural Language Processing (NLP). A commonly used model is Latent Dirichlet Allocation (LDA), which finds latent topics within large datasets of for example text articles. An extension of LDA is hierarchical Latent Dirichlet Allocation (hLDA) which is an hierarchical variant of LDA. In hLDA, the latent topics found among a set of articles are structured hierarchically in a tree. Each node represents a topic, and the levels represent different levels of abstraction in the topics. A further extension of hLDA is constrained hLDA, where a set of predefined, constrained topics are added to the tree. The constrained topics are extracted from the dataset by grouping highly correlated words. The idea of constrained hLDA is to improve the topic structure derived by a hLDA model by making the process semi-supervised. The aim of this thesis is to create a hLDA and a constrained hLDA model from a dataset of articles provided by Opera. The models should then be evaluated using the novel metric word frequency similarity, which is a measure of the similarity between the words representing the parent and child topics in a hierarchical topic model. The results show that word frequency similarity can be used to evaluate whether the topics in a parent-child topic pair are too similar, so that the child does not specify a subtopic of the parent. It can also be used to evaluate if the topics are too dissimilar, so that the topics seem unrelated and perhaps should not be connected in the hierarchy. The results also show that the two topic models created had comparable word frequency similarity scores. None of the models seemed to significantly outperform the other with regard to the metric.
Identifer | oai:union.ndltd.org:UPSALLA1/oai:DiVA.org:liu-167080 |
Date | January 2020 |
Creators | Lindgren, Jennifer |
Publisher | Linköpings universitet, Institutionen för datavetenskap |
Source Sets | DiVA Archive at Upsalla University |
Language | English |
Detected Language | English |
Type | Student thesis, info:eu-repo/semantics/bachelorThesis, text |
Format | application/pdf |
Rights | info:eu-repo/semantics/openAccess |
Page generated in 0.0093 seconds