Global ETD Search

Return to search

Infinite-word topic models for digital media

Digital media collections hold an unprecedented source of knowledge and data about the world. Yet, even at current scales, the data exceeds by many orders of magnitude the amount a single user could browse through in an entire lifetime. Making use of such data requires computational tools that can index, search over, and organize media documents in ways that are meaningful to human users, based on the meaning of their content. This dissertation develops an automated approach to analyzing digital media content based on topic models. Its primary contribution, the Infinite-Word Topic Model (IWTM), helps extend topic modeling to digital media domains by removing model assumptions that do not make sense for them -- in particular, the assumption that documents are composed of discrete, mutually-exclusive words from a fixed-size vocabulary. While conventional topic models like Latent Dirichlet Allocation (LDA) require that media documents be converted into bags of words, IWTM incorporates clustering into its probabilistic model and treats the vocabulary size as a random quantity to be inferred based on the data. Among its other benefits, IWTM achieves better performance than LDA while automating the selection of the vocabulary size. This dissertation contributes fast, scalable variational inference methods for IWTM that allow the model to be applied to large datasets. Furthermore, it introduces a new method, Incremental Variational Inference (IVI), for training IWTM and other Bayesian non-parametric models efficiently on growing datasets. IVI allows such models to grow in complexity as the dataset grows, as their priors state that they should. Finally, building on IVI, an active learning method for topic models is developed that intelligently samples new data, resulting in models that train faster, achieve higher performance, and use smaller amounts of labeled data. / text

http://hdl.handle.net/2152/24968

Machine learning

Topic models

Variational inference

Bayesian nonparametrics

Identifer	oai:union.ndltd.org:UTEXAS/oai:repositories.lib.utexas.edu:2152/24968
Date	02 July 2014
Creators	Waters, Austin Severn
Source Sets	University of Texas
Language	English
Detected Language	English
Type	Thesis
Format	application/pdf

Page generated in 0.0021 seconds

Infinite-word topic models for digital media

Description

Links & Downloads

Tags

Additional Fields