Return to search

Document Clustering with Dual Supervision

Nowadays, academic researchers maintain a personal library of papers, which they would like
to organize based on their needs, e.g., research, projects, or courseware. Clustering techniques
are often employed to achieve this goal by grouping the document collection into different
topics. Unsupervised clustering does not require any user effort but only produces one universal
output with which users may not be satisfied. Therefore, document clustering needs user input
for guidance to generate personalized clusters for different users. Semi-supervised clustering
incorporates prior information and has the potential to produce customized clusters. Traditional
semi-supervised clustering is based on user supervision in the form of labeled instances or
pairwise instance constraints. However, alternative forms of user supervision exist such as
labeling features. For document clustering, document supervision involves labeling documents
while feature supervision involves labeling features. Their joint of use has been called dual
supervision. In this thesis, we first explore and propose a framework to use feature supervision
for interactive feature selection by indicating whether a feature is useful for clustering.
Second, we enhance the semi-supervised clustering with feature supervision using feature
reweighting. Third, we propose a unified framework to combine document supervision and
feature supervision through seeding. The newly proposed algorithms are evaluated using oracles
and demonstrated to be more helpful in producing better clusters matching a single user's point
of view than document clustering without any supervision and with only document supervision.
Finally, we conduct a user study to confirm that different users have different understandings of
the same document collection and prefer personalized clusters. At the same time, we demonstrate
that document clustering with dual supervision is able to produce good personalized clusters
even with noisy user input. Dual supervision is also demonstrated to be more effective in
personalized clustering than no supervision or any single supervision. We also analyze users'
behaviors during the user study and present suggestions for the design of document management
software.

Identiferoai:union.ndltd.org:LACETR/oai:collectionscanada.gc.ca:NSHD.ca#10222/15056
Date19 June 2012
CreatorsHu, Yeming
Source SetsLibrary and Archives Canada ETDs Repository / Centre d'archives des thèses électroniques de Bibliothèque et Archives Canada
LanguageEnglish
Detected LanguageEnglish

Page generated in 0.0021 seconds