Global ETD Search

Return to search

Model-based clustering of high-dimensional binary data

We present a mixture of latent trait models with common slope parameters (MCLT) for high dimensional binary data, a data type for which few established methods exist. Recent work on clustering of binary data, based on a d-dimensional Gaussian latent variable, is extended by implementing common factor analyzers. We extend the model further by the incorporation of random block effects. The dependencies in each block are taken into account through block-specific parameters that are considered to be random variables. A variational approximation to the likelihood is exploited to derive a fast algorithm for determining the model parameters. The Bayesian information criterion is used to select the number of components and the covariance structure as well as the dimensions of latent variables. Our approach is demonstrated on U.S. Congressional voting data and on a data set describing the sensory properties of orange juice. Our examples show that our model performs well even when the number of observations is not very large relative to the data dimensionality. In both cases, our approach yields intuitive clustering results. Additionally, our dimensionality-reduction method allows data to be displayed in low-dimensional plots. / Early Researcher Award from the Government of Ontario (McNicholas); NSERC Discovery Grants (Browne and McNicholas).

http://hdl.handle.net/10214/7458

Identifer	oai:union.ndltd.org:LACETR/oai:collectionscanada.gc.ca:OGU.10214/7458
Date	05 September 2013
Creators	Tang, Yang
Contributors	McNicholas, Paul D., Browne, Ryan P.
Source Sets	Library and Archives Canada ETDs Repository / Centre d'archives des thèses électroniques de Bibliothèque et Archives Canada
Language	English
Detected Language	English
Type	Thesis

Page generated in 0.0021 seconds

Model-based clustering of high-dimensional binary data

Description

Links & Downloads

Tags

Additional Fields