Topic models are useful tools for exploring large data sets of textual content by exposing a generative process from which the text was produced. Anchor-based topic models utilize the anchor word assumption to define a set of algorithms with provable guarantees which recover the underlying topics with a run time practically independent of corpus size. A number of extensions to the initial anchor word-based algorithms, and enhancements made to tangential models, have been proposed which improve the intrinsic characteristics of the model making them more interpretable by humans. This thesis evaluates improvements to human interpretability due to: low-dimensional word embeddings in combination with a regularized objective function, automatic topic merging using tandem anchors, and utilizing word embeddings to synthetically increase corpus density. Results show that tandem anchors are viable vehicles for automatic topic merging, and that using word embeddings significantly improves the original anchor method across all measured metrics. Combining low-dimensional embeddings and a regularized objective results in computational downsides with small or no improvements to the metrics measured.
Identifer | oai:union.ndltd.org:UPSALLA1/oai:DiVA.org:liu-168134 |
Date | January 2020 |
Creators | Andersson, Henrik |
Publisher | Linköpings universitet, Interaktiva och kognitiva system |
Source Sets | DiVA Archive at Upsalla University |
Language | English |
Detected Language | English |
Type | Student thesis, info:eu-repo/semantics/bachelorThesis, text |
Format | application/pdf |
Rights | info:eu-repo/semantics/openAccess |
Page generated in 0.0023 seconds