This thesis studies two problems in theoretical machine learning. The first
part of the thesis investigates the statistical stability of clustering
algorithms. In the second part, we study the relative advantage of having
unlabeled data in classification problems.
Clustering stability was proposed and used as a model selection method in
clustering tasks. The main idea of the method is that from a given data set
two independent samples are taken. Each sample individually is clustered with
the same clustering algorithm, with the same setting of its parameters. If the
two resulting clusterings turn out to be close in some metric, it is concluded
that the clustering algorithm and the setting of its parameters match the data
set, and that clusterings obtained are meaningful. We study asymptotic
properties of this method for certain types of cost minimizing clustering
algorithms and relate their asymptotic stability to the number of optimal
solutions of the underlying optimization problem.
In classification problems, it is often expensive to obtain labeled data, but
on the other hand, unlabeled data are often plentiful and cheap. We study how
the access to unlabeled data can decrease the amount of labeled data
needed in the worst-case sense. We propose an extension of the probably
approximately correct (PAC) model in which this question can be naturally
studied. We show that for certain basic tasks the access to unlabeled data
might, at best, halve the amount of labeled data needed.
Identifer | oai:union.ndltd.org:LACETR/oai:collectionscanada.gc.ca:OWTU.10012/4445 |
Date | 21 May 2009 |
Creators | Pal, David |
Source Sets | Library and Archives Canada ETDs Repository / Centre d'archives des thèses électroniques de Bibliothèque et Archives Canada |
Language | English |
Detected Language | English |
Type | Thesis or Dissertation |
Page generated in 0.0018 seconds