101 |
A new framework for clusteringZhou, Wu January 2010 (has links)
The difficulty of clustering and the variety of clustering methods suggest the need for a theoretical study of clustering. Using the idea of a standard statistical framework, we propose a new framework for clustering.
For a well-defined clustering goal we assume that the data to be clustered come from an underlying distribution and we aim to find a high-density cluster tree. We regard this tree as a parameter of interest for the underlying distribution. However, it is not obvious how to determine a connected subset in a discrete distribution whose support is located in a Euclidean space. Building a cluster tree for such a distribution is an open problem and presents interesting conceptual and computational challenges. We solve this problem using graph-based approaches and further parameterize clustering using the high-density cluster tree and its extension.
Motivated by the connection between clustering outcomes and graphs, we propose a graph family framework. This framework plays an important role in our clustering framework. A direct application of the graph family framework is a new cluster-tree distance measure. This distance measure can be written as an inner product or kernel. It makes our clustering framework able to perform statistical assessment of clustering via simulation. Other applications such as a method for integrating partitions into a cluster tree and methods for cluster tree averaging and bagging are also derived from the graph family framework.
|
102 |
A New Measure For Clustering Model SelectionMcCrosky, Jesse January 2008 (has links)
A new method for determining the number of k-means clusters in a given data set is presented. The algorithm is developed from a theoretical perspective and then its implementation is examined and compared to existing solutions.
|
103 |
A new framework for clusteringZhou, Wu January 2010 (has links)
The difficulty of clustering and the variety of clustering methods suggest the need for a theoretical study of clustering. Using the idea of a standard statistical framework, we propose a new framework for clustering.
For a well-defined clustering goal we assume that the data to be clustered come from an underlying distribution and we aim to find a high-density cluster tree. We regard this tree as a parameter of interest for the underlying distribution. However, it is not obvious how to determine a connected subset in a discrete distribution whose support is located in a Euclidean space. Building a cluster tree for such a distribution is an open problem and presents interesting conceptual and computational challenges. We solve this problem using graph-based approaches and further parameterize clustering using the high-density cluster tree and its extension.
Motivated by the connection between clustering outcomes and graphs, we propose a graph family framework. This framework plays an important role in our clustering framework. A direct application of the graph family framework is a new cluster-tree distance measure. This distance measure can be written as an inner product or kernel. It makes our clustering framework able to perform statistical assessment of clustering via simulation. Other applications such as a method for integrating partitions into a cluster tree and methods for cluster tree averaging and bagging are also derived from the graph family framework.
|
104 |
Approach to Evaluating Clustering Using Classification Labelled DataLuu, Tuong January 2010 (has links)
Cluster analysis has been identified as a core task in data mining for which many different algorithms have been proposed. The diversity, on one hand, provides us a wide collection of tools. On the other hand, the profusion of options easily causes confusion. Given a particular task, users do not know which algorithm is good since it is not clear how clustering algorithms should be evaluated. As a consequence, users often select clustering algorithm in a very adhoc manner.
A major challenge in evaluating clustering algorithms is the scarcity of real data
with a "correct" ground truth clustering. This is in stark contrast
to the situation for classification tasks, where there
are abundantly many data sets labeled with their correct classifications. As a result, clustering research often relies on labeled data to evaluate and compare the results of clustering algorithms.
We present a new perspective on how to use labeled data for evaluating clustering algorithms, and develop an approach for comparing clustering algorithms on the basis of classification labeled data. We then use this approach to support a novel technique for choosing among clustering algorithms when no labels are available.
We use these tools to demonstrate that the utility of an algorithm depends on the specific clustering task. Investigating a set of common clustering
algorithms, we demonstrate that there are cases where each one of them
outputs better clusterings. In contrast to the current trend of looking for a superior clustering algorithm, our findings demonstrate the need for a variety of different clustering algorithms.
|
105 |
Towards Theoretical Foundations of ClusteringAckerman, Margareta January 2012 (has links)
Clustering is a central unsupervised learning task with a wide variety of applications. Unlike in supervised learning, different clustering algorithms may yield dramatically different outputs for the same input sets. As such, the choice of algorithm is crucial. When selecting a clustering algorithm, users tend to focus on cost-related considerations, such as running times, software purchasing costs, etc. Yet differences concerning the output of the algorithms are a more primal consideration. We propose an approach for selecting clustering algorithms based on differences in their input-output behaviour. This approach relies on identifying significant properties of clustering algorithms and classifying algorithms based on the properties that they satisfy.
We begin with Kleinberg's impossibility result, which relies on concise abstract properties that are well-suited for our approach. Kleinberg showed that three specific properties cannot be satisfied by the same algorithm. We illustrate that the impossibility result is a consequence of the formalism used, proving that these properties can be formulated without leading to inconsistency in the context of clustering quality measures or algorithms whose input requires the number of clusters.
Combining Kleinberg's properties with newly proposed ones, we provide an extensive property-base classification of common clustering paradigms. We use some of these properties to provide a novel characterization of the class of linkage-based algorithms. That is, we distil a small set of properties that uniquely identify this family of algorithms.
Lastly, we investigate how the output of algorithms is affected by the addition of small, potentially adversarial, sets of points. We prove that given clusterable input, the output of $k$-means is robust to the addition of a small number of data points. On the other hand, clusterings produced by many well-known methods, including linkage-based techniques, can be changed radically by adding a small number of elements.
|
106 |
A Comparison of Clustering Methods for Developing Models of User InterestGanta, Prasanth 2011 May 1900 (has links)
For open-ended information tasks, users must sift through many potentially relevant documents assessing and prioritizing them based on relevance to current
information need, a practice we refer to as document triage. Users often perform triage through their interaction with multiple applications, and to efficiently support them in this process an extensible multi-application architecture Interest Profile Manager(IPM) was developed in the prior research at Texas A & M University. IPM infers user interests from their interactions with documents, especially the interests expressed by the user through an interpretive action like assigning a visual characteristic color, coupled with the document’s content characteristics. IPM equates each specific color and application as an interest class and the main challenge for the user is to consistently maintain interest class-color scheme across applications forever which is not practical. This thesis presents a system that can help reduce potential problems caused
by these inconsistencies, by indicating when such inconsistencies have occurred in the past or are happening in the user’s current triage activity. It includes (1)a clustering algorithm to group similar triage interest instances by choosing the factors that could define the similarity of interest instances, and (2)an approach to identify sequences of user actions that provide strong evidence of user’s intent which can be
used as constraints during clustering. Constrained and unconstrained versions of three Agglomerative Hierarchical Clustering algorithms: (1)Single-Link, (2)Complete-Link, (3) UPGMA(Unweighted Pair Group Method with Arithmetic Mean) have been studied. The contribution of each of the three factors: (1)Content Similarity, (2)Temporal Similarity, and (3)Visual Similarity to the overall similarity between interest instances has also been examined. Our results indicate that the Single-Link algorithm performs better than the other two clustering algorithms while the combination of all three similarity
factors defines the similarity between two instances better than considering any single factor. The use of constraints as strong evidence about user’s intent improved the clustering efficiency of algorithms.
|
107 |
Algorithms for Gene Clustering Analysis on GenomesYi, Gang Man 2011 May 1900 (has links)
The increased availability of data in biological databases provides many opportunities for understanding biological processes through these data. As recent attention has shifted from sequence analysis to higher-level analysis of genes across multiple genomes, there is a need to develop efficient algorithms for these large-scale applications that can help us understand the functions of genes.
The overall objective of my research was to develop improved methods which can automatically assign groups of functionally related genes in large-scale data sets by applying new gene clustering algorithms. Proposed gene clustering algorithms that can help us understand gene function and genome evolution include new algorithms
for protein family classification, a window-based strategy for gene clustering on chromosomes, and an exhaustive strategy that allows all clusters of small size to be enumerated. I investigate the problems of gene clustering in multiple genomes, and define gene clustering problems using mathematical methodology and solve the problems by developing efficient and effective algorithms.
For protein family classification, I developed two supervised classification algorithms that can assign proteins to existing protein families in public databases and, by taking into account similarities between the unclassified proteins, allows for progressive construction of new families from proteins that cannot be assigned. This approach is useful for rapid assignment of protein sequences from genome sequencing projects to protein families. A comparative analysis of the method to other previously developed methods shows that the algorithm has a higher accuracy rate and lower mis-classification rate when compared to algorithms that are based on the use of multiple sequence alignments and hidden Markov models. The proposed algorithm performs well even on families with very few proteins and on families with low sequence similarity.
Apart from the analysis of individual sequences, identifying genomic regions that descended from a common ancestor helps us study gene function and genome evolution. In distantly related genomes, clusters of homologous gene pairs serve as evidence used in function prediction, operon detection, etc. Thus, reliable identification of gene clusters is critical to functional annotation and analysis of genes. I developed an efficient gene clustering algorithm that can be applied on hundreds of genomes at the same time. This approach allows for large-scale study of evolutionary relationships
of gene clusters and study of operon formation and destruction. By placing a stricter limit on the maximum cluster size, I developed another algorithm that uses a different formulation based on constraining the overall size of a cluster and statistical estimates that allow direct comparisons of clusters of different size. A comparative analysis of proposed algorithms shows that more biological insight can be obtained by analyzing gene clusters across hundreds of genomes, which can help us understand operon occurrences, gene orientations and gene rearrangements.
|
108 |
Social-Economy Approach toward Social Capital, Trust and Industrial ClusteringHung, Chia-Jia 16 October 2004 (has links)
None
|
109 |
A kernel-based fuzzy clustering algorithm and its application in classificationWang, Jiun-hau 25 July 2006 (has links)
In this paper, we purpose a kernel-based fuzzy clustering algorithm to cluster data patterns in the feature space. Our method uses kernel functions to project data from the original space into a high dimensional feature space, and data are divided into groups though their similarities in the feature space with an incremental clustering approach. After clustering, data patterns of the same cluster in the feature space are then grouped with an arbitrarily shaped boundary in the original space. As a result, clusters with arbitrary shapes are discovered in the original space. Clustering, which can be taken as unsupervised classification, has also been utilized in resolving classification problems. So, we extend our method to process the classification problems. By working in the high dimensional feature space where the data are expected to more separable, we can discover the inner structure of the data distribution. Therefore, our method has the advantage of dealing with new incoming data pattern efficiently. The effectiveness of our method is demonstrated in the experiment.
|
110 |
Nonparametric Bayesian analysis of some clustering problemsRay, Shubhankar 30 October 2006 (has links)
Nonparametric Bayesian models have been researched extensively in the past 10 years
following the work of Escobar and West (1995) on sampling schemes for Dirichlet processes.
The infinite mixture representation of the Dirichlet process makes it useful
for clustering problems where the number of clusters is unknown. We develop nonparametric
Bayesian models for two different clustering problems, namely functional
and graphical clustering.
We propose a nonparametric Bayes wavelet model for clustering of functional or
longitudinal data. The wavelet modelling is aimed at the resolution of global and
local features during clustering. The model also allows the elicitation of prior belief
about the regularity of the functions and has the ability to adapt to a wide range
of functional regularity. Posterior inference is carried out by Gibbs sampling with
conjugate priors for fast computation. We use simulated as well as real datasets to
illustrate the suitability of the approach over other alternatives.
The functional clustering model is extended to analyze splice microarray data.
New microarray technologies probe consecutive segments along genes to observe alternative
splicing (AS) mechanisms that produce multiple proteins from a single gene.
Clues regarding the number of splice forms can be obtained by clustering the functional
expression profiles from different tissues. The analysis was carried out on the Rosetta dataset (Johnson et al., 2003) to obtain a splice variant by tissue distribution
for all the 10,000 genes. We were able to identify a number of splice forms that appear
to be unique to cancer.
We propose a Bayesian model for partitioning graphs depicting dependencies
in a collection of objects. After suitable transformations and modelling techniques,
the problem of graph cutting can be approached by nonparametric Bayes clustering.
We draw motivation from a recent work (Dhillon, 2001) showing the equivalence of
kernel k-means clustering and certain graph cutting algorithms. It is shown that
loss functions similar to the kernel k-means naturally arise in this model, and the
minimization of associated posterior risk comprises an effective graph cutting strategy.
We present here results from the analysis of two microarray datasets, namely the
melanoma dataset (Bittner et al., 2000) and the sarcoma dataset (Nykter et al.,
2006).
|
Page generated in 0.1011 seconds