Global ETD Search

A probabilistic framework and algorithms for modeling and analyzing multi-instance data

Multi-instance data, in which each object (e.g., a document) is a collection of instances
(e.g., word), are widespread in machine learning, signal processing, computer vision,
bioinformatic, music, and social sciences. Existing probabilistic models, e.g., latent
Dirichlet allocation (LDA), probabilistic latent semantic indexing (pLSI), and discrete
component analysis (DCA), have been developed for modeling and analyzing multiinstance
data. Such models introduce a generative process for multi-instance data which
includes a low dimensional latent structure. While such models offer a great freedom
in capturing the natural structure in the data, their inference may present challenges.
For example, the sensitivity in choosing the hyper-parameters in such models, requires
careful inference (e.g., through cross-validation) which results in large computational
complexity. The inference for fully Bayesian models which contain no hyper-parameters
often involves slowly converging sampling methods. In this work, we develop approaches
for addressing such challenges and further enhancing the utility of such models.

This dissertation demonstrates a unified convex framework for probabilistic modeling
of multi-instance data. The three main aspects of the proposed framework are as follows.
First, joint regularization is incorporated into multiple density estimation to simultaneously
learn the structure of the distribution space and infer each distribution. Second,
a novel confidence constraints framework is used to facilitate a tuning-free approach to
control the amount of regularization required for the joint multiple density estimation
with theoretical guarantees on correct structure recovery. Third, we formulate the problem
using a convex framework and propose efficient optimization algorithms to solve
it.

This work addresses the unique challenges associated with both discrete and continuous
domains. In the discrete domain we propose a confidence-constrained rank minimization
(CRM) to recover the exact number of topics in topic models with theoretical
guarantees on recovery probability and mean squared error of the estimation. We provide
a computationally efficient optimization algorithm for the problem to further the
applicability of the proposed framework to large real world datasets. In the continuous
domain, we propose to use the maximum entropy (MaxEnt) framework for multi-instance
datasets. In this approach, bags of instances are represented as distributions using the
principle of MaxEnt. We learn basis functions which span the space of distributions for
jointly regularized density estimation. The basis functions are analogous to topics in a
topic model.

We validate the efficiency of the proposed framework in the discrete and continuous
domains by extensive set of experiments on synthetic datasets as well as on real world
image and text datasets and compare the results with state-of-the-art algorithms. / Graduation date: 2013

http://hdl.handle.net/1957/35782

Multi-instance learning

Maximum entropy

Low-rank matrix recovery

Confidence-constraint

Nuclear norm minimization

Entropy (Information theory)

Machine learning -- Mathematical models

Computer algorithms

Identifer	oai:union.ndltd.org:ORGSU/oai:ir.library.oregonstate.edu:1957/35782
Date	28 November 2012
Creators	Behmardi, Behrouz
Contributors	Raich, Raviv
Source Sets	Oregon State University
Language	en_US
Detected Language	English
Type	Thesis/Dissertation

Page generated in 0.0025 seconds

A probabilistic framework and algorithms for modeling and analyzing multi-instance data

Description

Links & Downloads

Tags

Additional Fields