Global ETD Search

Return to search

Bisecting Document Clustering Using Model-Based Methods

We all have access to large collections of digital text documents, which are useful only if we can make sense of them all and distill important information from them. Good document clustering algorithms that organize such information automatically in meaningful ways can make a difference in how effective we are at using that information. In this paper we use model-based document clustering algorithms as a base for bisecting methods in order to identify increasingly cohesive clusters from larger, more diverse clusters. We specifically use the EM algorithm and Gibbs Sampling on a mixture of multinomials as the base clustering algorithms on three data sets. Additionally, we apply a refinement step, using EM, to the final output of each clustering technique. Our results show improved agreement with human annotated document classes when compared to the existing base clustering algorithms, with marked improvement in two out of three data sets.

Identifer	oai:union.ndltd.org:BGMYU2/oai:scholarsarchive.byu.edu:etd-2937
Date	09 December 2009
Creators	Davis, Aaron Samuel
Publisher	BYU ScholarsArchive
Source Sets	Brigham Young University
Detected Language	English
Type	text
Format	application/pdf
Source	Theses and Dissertations
Rights	http://lib.byu.edu/about/copyright/

Page generated in 0.0017 seconds

Bisecting Document Clustering Using Model-Based Methods

Description

Links & Downloads

Tags

Additional Fields