Over the past few years, there has been a renewed interest in the consensus
problem for ensembles of partitions. Recent work is primarily motivated by the
developments in the area of combining multiple supervised learners. Unlike the
consensus of supervised classifications, the consensus of data partitions is a
challenging problem due to the lack of globally defined cluster labels and to
the inherent difficulty of data clustering as an unsupervised learning problem.
Moreover, the true number of clusters may be unknown. A fundamental goal of
consensus methods for partitions is to obtain an optimal summary of an ensemble
and to discover a cluster structure with accuracy and robustness exceeding those
of the individual ensemble partitions.
The quality of the consensus partitions highly depends on the ensemble
generation mechanism and on the suitability of the consensus method for
combining the generated ensemble. Typically, consensus methods derive an
ensemble representation that is used as the basis for extracting the consensus
partition. Most ensemble representations circumvent the labeling problem. On
the other hand, voting-based methods establish direct parallels with consensus
methods for supervised classifications, by seeking an optimal relabeling of the
ensemble partitions and deriving an ensemble representation consisting of a
central aggregated partition. An important element of the voting-based
aggregation problem is the pairwise relabeling of an ensemble partition with
respect to a representative partition of the ensemble, which is refered to here
as the voting problem. The voting problem is commonly formulated as a weighted
bipartite matching problem.
In this dissertation, a general theoretical framework for the voting problem as
a multi-response regression problem is proposed. The problem is formulated as
seeking to estimate the uncertainties associated with the assignments of the
objects to the representative clusters, given their assignments to the clusters
of an ensemble partition. A new voting scheme, referred to as cumulative voting,
is derived as a special instance of the proposed regression formulation
corresponding to fitting a linear model by least squares estimation. The
proposed formulation reveals the close relationships between the underlying loss
functions of the cumulative voting and bipartite matching schemes. A useful
feature of the proposed framework is that it can be applied to model substantial
variability between partitions, such as a variable number of clusters.
A general aggregation algorithm with variants corresponding to
cumulative voting and bipartite matching is applied and a simulation-based
analysis is presented to compare the suitability of each scheme to different
ensemble generation mechanisms. The bipartite matching is found to be more
suitable than cumulative voting for a particular generation model, whereby each
ensemble partition is generated as a noisy permutation of an underlying
labeling, according to a probability of error. For ensembles with a variable
number of clusters, it is proposed that the aggregated partition be viewed as an
estimated distributional representation of the ensemble, on the basis of which,
a criterion may be defined to seek an optimally compressed consensus partition.
The properties and features of the proposed cumulative voting scheme are
studied. In particular, the relationship between cumulative voting and the
well-known co-association matrix is highlighted. Furthermore, an adaptive
aggregation algorithm that is suited for the cumulative voting scheme is
proposed. The algorithm aims at selecting the initial reference partition and
the aggregation sequence of the ensemble partitions the loss of mutual
information associated with the aggregated partition is minimized. In order to
subsequently extract the final consensus partition, an efficient agglomerative
algorithm is developed. The algorithm merges the aggregated clusters such that
the maximum amount of information is preserved. Furthermore, it allows the
optimal number of consensus clusters to be estimated.
An empirical study using several artificial and real-world datasets demonstrates
that the proposed cumulative voting scheme leads to discovering substantially
more accurate consensus partitions compared to bipartite matching, in the case
of ensembles with a relatively large or a variable number of clusters. Compared
to other recent consensus methods, the proposed method is found to be comparable
with or better than the best performing methods. Moreover, accurate estimates of
the true number of clusters are often achieved using cumulative voting, whereas
consistently poor estimates are achieved based on bipartite matching. The
empirical evidence demonstrates that the bipartite matching scheme is not
suitable for these types of ensembles.
Identifer | oai:union.ndltd.org:WATERLOO/oai:uwspace.uwaterloo.ca:10012/3934 |
Date | 08 1900 |
Creators | Ayad, Hanan |
Source Sets | University of Waterloo Electronic Theses Repository |
Language | English |
Detected Language | English |
Type | Thesis or Dissertation |
Page generated in 0.0022 seconds