Global ETD Search

Return to search

Latent Dirichlet Allocation in R

Topic models are a new research field within the computer sciences information retrieval and text mining. They are generative probabilistic models of text corpora inferred by machine learning and they can be used for retrieval and text mining tasks. The most prominent topic model is latent Dirichlet allocation (LDA), which was introduced in 2003 by Blei et al. and has since then sparked off the development of other topic models for domain-specific purposes.
This thesis focuses on LDA's practical application. Its main goal is the replication of the data analyses from the 2004 LDA paper ``Finding scientific topics'' by Thomas Griffiths and Mark Steyvers within the framework of the R statistical programming language and the R~package topicmodels by Bettina Grün and Kurt Hornik. The complete process, including extraction of a text corpus from the PNAS journal's website, data preprocessing, transformation into a document-term matrix, model selection, model estimation, as well as presentation of the results, is fully documented and commented. The outcome closely matches the analyses of the original paper, therefore the research by Griffiths/Steyvers can be reproduced. Furthermore, this thesis proves the suitability of the R environment for text mining with LDA. (author's abstract) / Series: Theses / Institute for Statistics and Mathematics

http://epub.wu.ac.at/3558/1/main.pdf

Identifer	oai:union.ndltd.org:VIENNA/oai:epub.wu-wien.ac.at:3558
Date	05 1900
Creators	Ponweiser, Martin
Publisher	WU Vienna University of Economics and Business
Source Sets	Wirtschaftsuniversität Wien
Language	English
Detected Language	English
Type	Paper, NonPeerReviewed
Format	application/pdf
Relation	http://epub.wu.ac.at/3558/

Page generated in 0.0019 seconds

Latent Dirichlet Allocation in R

Description

Links & Downloads

Tags

Additional Fields