Global ETD Search

Return to search

Evaluating Clusterings by Estimating Clarity

In this thesis I examine clustering evaluation, with a subfocus on text clusterings specifically. The principal work
of this thesis is the development, analysis, and testing of a new internal clustering quality measure called informativeness.

I begin by reviewing clustering in general. I then review current clustering
quality measures, accompanying this with an in-depth discussion of many of the important properties one needs to understand about such measures. This is followed by extensive document clustering experiments that show problems with standard clustering evaluation practices.

I then develop informativeness, my new internal clustering quality measure for estimating the clarity of clusterings. I show that informativeness, which uses classification accuracy as a proxy for human assessment of clusterings, is both theoretically sensible and works empirically. I present a generalization of informativeness that leverages external clustering quality measures. I also show its use in a realistic application: email spam filtering. I show that informativeness can be used to select clusterings which lead to superior spam filters when few true labels are available.

I conclude this thesis with a discussion of clustering evaluation in general, informativeness, and the directions I believe clustering evaluation research should take in the future.

http://hdl.handle.net/10012/7103

clustering

evaluating clustering

cluster validation

cluster analysis

Computer Science

Identifer	oai:union.ndltd.org:WATERLOO/oai:uwspace.uwaterloo.ca:10012/7103
Date	January 2012
Creators	Whissell, John
Source Sets	University of Waterloo Electronic Theses Repository
Language	English
Detected Language	English
Type	Thesis or Dissertation

Page generated in 0.0023 seconds

Evaluating Clusterings by Estimating Clarity

Description

Links & Downloads

Tags

Additional Fields