71 |
Cooperative Clustering Model and Its ApplicationsKashef, Rasha January 2008 (has links)
Data clustering plays an important role in many disciplines, including data mining, machine learning, bioinformatics, pattern recognition, and other fields, where there is a need to learn the inherent grouping structure of data in an unsupervised manner. There are many clustering approaches proposed in the literature with different quality/complexity tradeoffs. Each clustering algorithm works on its domain space with no optimum solution to all datasets of different properties, sizes, structures, and distributions. Challenges in data clustering include, identifying proper number of clusters, scalability of the clustering approach, robustness to noise, tackling distributed datasets, and handling clusters of different configurations. This thesis addresses some of these challenges through cooperation between multiple clustering approaches.
We introduce a Cooperative Clustering (CC) model that involves multiple clustering techniques; the goal of the cooperative model is to increase the homogeneity of objects within clusters through cooperation by developing two data structures, cooperative contingency graph and histogram representation of pair-wise similarities. The two data structures are designed to find the matching sub-clusters between different clusterings and to obtain the final set of cooperative clusters through a merging process. Obtaining the co-occurred objects from the different clusterings enables the cooperative model to group objects based on a multiple agreement between the invoked clustering techniques. In addition, merging this set of sub-clusters using histograms poses a new trend of grouping objects into more homogenous clusters. The cooperative model is consistent, reusable, and scalable in terms of the number of the adopted clustering approaches.
In order to deal with noisy data, a novel Cooperative Clustering Outliers Detection (CCOD) algorithm is implemented through the implication of the cooperation methodology for better detection of outliers in data. The new detection approach is designed in four phases, (1) Global non-cooperative Clustering, (2) Cooperative Clustering, (3) Possible outlier’s Detection, and finally (4) Candidate Outliers Detection. The detection of outliers is established in a bottom-up scenario.
The thesis also addresses cooperative clustering in distributed Peer-to-Peer (P2P) networks. Mining large and inherently distributed datasets poses many challenges, one of which is the extraction of a global model as a global summary of the clustering solutions generated from all nodes for the purpose of interpreting the clustering quality of the distributed dataset as if it was located at one node. We developed distributed cooperative model and architecture that work on a two-tier super-peer P2P network. The model is called Distributed Cooperative Clustering in Super-peer P2P Networks (DCCP2P). This model aims at producing one clustering solution across the whole network. It specifically addresses scalability of network size, and consequently the distributed clustering complexity, by modeling the distributed clustering problem as two layers of peer neighborhoods and super-peers. Summarization of the global distributed clusters is achieved through a distributed version of the cooperative clustering model.
Three clustering algorithms, k-means (KM), Bisecting k-means (BKM) and Partitioning Around Medoids (PAM) are invoked in the cooperative model. Results on various gene expression and text documents datasets with different properties, configurations and different degree of outliers reveal that: (i) the cooperative clustering model achieves significant improvement in the quality of the clustering solutions compared to that of the non-cooperative individual approaches; (ii) the cooperative detection algorithm discovers the nonconforming objects in data with better accuracy than the contemporary approaches, and (iii) the distributed cooperative model attains the same quality or even better as the centralized approach and achieves decent speedup by increasing number of nodes. The distributed model offers high degree of flexibility, scalability, and interpretability of large distributed repositories. Achieving the same results using current methodologies requires polling the data first to one center location, which is sometimes not feasible.
|
72 |
A Theoretical Study of Clusterability and Clustering QualityAckerman, Margareta January 2007 (has links)
Clustering is a widely used technique, with applications ranging
from data mining, bioinformatics and image analysis to marketing,
psychology, and city planning. Despite the practical importance of
clustering, there is very limited theoretical analysis of the topic.
We make a step towards building theoretical foundations for
clustering by carrying out an abstract analysis of two central
concepts in clustering; clusterability and clustering quality.
We compare a number of notions of clusterability found in the
literature. While all these notions attempt to measure the same
property, and all appear to be reasonable, we show that they are
pairwise inconsistent. In addition, we give the first computational
complexity analysis of a few notions of clusterability.
In the second part of the thesis, we discuss how the quality of a
given clustering can be defined (and measured). Users often need to
compare the quality of clusterings obtained by different methods.
Perhaps more importantly, users need to determine whether a given
clustering is sufficiently good for being used in further data
mining analysis. We analyze what a measure of clustering quality
should look like. We do that by introducing a set of requirements
(`axioms') of clustering quality measures. We propose a number of
clustering quality measures that satisfy these requirements.
|
73 |
Cooperative Clustering Model and Its ApplicationsKashef, Rasha January 2008 (has links)
Data clustering plays an important role in many disciplines, including data mining, machine learning, bioinformatics, pattern recognition, and other fields, where there is a need to learn the inherent grouping structure of data in an unsupervised manner. There are many clustering approaches proposed in the literature with different quality/complexity tradeoffs. Each clustering algorithm works on its domain space with no optimum solution to all datasets of different properties, sizes, structures, and distributions. Challenges in data clustering include, identifying proper number of clusters, scalability of the clustering approach, robustness to noise, tackling distributed datasets, and handling clusters of different configurations. This thesis addresses some of these challenges through cooperation between multiple clustering approaches.
We introduce a Cooperative Clustering (CC) model that involves multiple clustering techniques; the goal of the cooperative model is to increase the homogeneity of objects within clusters through cooperation by developing two data structures, cooperative contingency graph and histogram representation of pair-wise similarities. The two data structures are designed to find the matching sub-clusters between different clusterings and to obtain the final set of cooperative clusters through a merging process. Obtaining the co-occurred objects from the different clusterings enables the cooperative model to group objects based on a multiple agreement between the invoked clustering techniques. In addition, merging this set of sub-clusters using histograms poses a new trend of grouping objects into more homogenous clusters. The cooperative model is consistent, reusable, and scalable in terms of the number of the adopted clustering approaches.
In order to deal with noisy data, a novel Cooperative Clustering Outliers Detection (CCOD) algorithm is implemented through the implication of the cooperation methodology for better detection of outliers in data. The new detection approach is designed in four phases, (1) Global non-cooperative Clustering, (2) Cooperative Clustering, (3) Possible outlier’s Detection, and finally (4) Candidate Outliers Detection. The detection of outliers is established in a bottom-up scenario.
The thesis also addresses cooperative clustering in distributed Peer-to-Peer (P2P) networks. Mining large and inherently distributed datasets poses many challenges, one of which is the extraction of a global model as a global summary of the clustering solutions generated from all nodes for the purpose of interpreting the clustering quality of the distributed dataset as if it was located at one node. We developed distributed cooperative model and architecture that work on a two-tier super-peer P2P network. The model is called Distributed Cooperative Clustering in Super-peer P2P Networks (DCCP2P). This model aims at producing one clustering solution across the whole network. It specifically addresses scalability of network size, and consequently the distributed clustering complexity, by modeling the distributed clustering problem as two layers of peer neighborhoods and super-peers. Summarization of the global distributed clusters is achieved through a distributed version of the cooperative clustering model.
Three clustering algorithms, k-means (KM), Bisecting k-means (BKM) and Partitioning Around Medoids (PAM) are invoked in the cooperative model. Results on various gene expression and text documents datasets with different properties, configurations and different degree of outliers reveal that: (i) the cooperative clustering model achieves significant improvement in the quality of the clustering solutions compared to that of the non-cooperative individual approaches; (ii) the cooperative detection algorithm discovers the nonconforming objects in data with better accuracy than the contemporary approaches, and (iii) the distributed cooperative model attains the same quality or even better as the centralized approach and achieves decent speedup by increasing number of nodes. The distributed model offers high degree of flexibility, scalability, and interpretability of large distributed repositories. Achieving the same results using current methodologies requires polling the data first to one center location, which is sometimes not feasible.
|
74 |
Evaluating Clusterings by Estimating ClarityWhissell, John January 2012 (has links)
In this thesis I examine clustering evaluation, with a subfocus on text clusterings specifically. The principal work
of this thesis is the development, analysis, and testing of a new internal clustering quality measure called informativeness.
I begin by reviewing clustering in general. I then review current clustering
quality measures, accompanying this with an in-depth discussion of many of the important properties one needs to understand about such measures. This is followed by extensive document clustering experiments that show problems with standard clustering evaluation practices.
I then develop informativeness, my new internal clustering quality measure for estimating the clarity of clusterings. I show that informativeness, which uses classification accuracy as a proxy for human assessment of clusterings, is both theoretically sensible and works empirically. I present a generalization of informativeness that leverages external clustering quality measures. I also show its use in a realistic application: email spam filtering. I show that informativeness can be used to select clusterings which lead to superior spam filters when few true labels are available.
I conclude this thesis with a discussion of clustering evaluation in general, informativeness, and the directions I believe clustering evaluation research should take in the future.
|
75 |
An Efficient Parameter-Relationship-Based Approach for Projected ClusteringHuang, Tsun-Kuei 16 June 2008 (has links)
The clustering problem has been discussed extensively in the database literature as a tool for many applications, for example, bioinformatics. Traditional clustering algorithms consider all of the dimensions of an input dataset in an attempt to learn as much as possible about each object described. In the high dimensional data, however, many of the dimensions are often irrelevant. Therefore, projected clustering is proposed. A projected cluster is a subset C of data points together with a subset D of dimensions such that the points in C are closely clustered in the subspace of dimensions D. There have been many algorithms proposed to find the projected cluster. Most of them can be divided into three kinds of classification: partitioning, density-based, and hierarchical. The DOC algorithm is one of well-known density-based algorithms for projected clustering. It uses a Monte Carlo algorithm for iteratively computing projected clusters, and proposes a formula to calculate the quality of cluster. The FPC algorithm is an extended version of the DOC algorithm, it uses the mining large itemsets approach to find the dimensions of projected cluster. Finding the large itemsets is the main goal of mining association rules,
where a large itemset is a combination of items whose appearing times in the dataset is greater than a given threshold. Although the FPC algorithm has used the technique of mining large itemsets to speed up finding projected clusters, it still needs many user-specified parameters to work. Moreover, in the first step, to choose the medoid, the FPC algorithm applies a random approach for several times to get the medoid, which takes long time and may still find a bad medoid. Furthermore, the way to calculate the quality of a cluster can be considered in more details, if we take the weight of dimensions into consideration. Therefore, in this thesis, we propose an algorithm which improves those disadvantages. First, we observe that the relationship between parameters, and propose a parameter-relationship-based algorithm that needs only two parameters, instead of three parameters in most of projected clustering algorithms. Next, our algorithm chooses the medoid with the median, we choose the medoid only one time and the quality of our cluster is better than that in the FPC algorithm. Finally, our quality measure formula considers the weight of each dimension of the cluster, and gives different values according to the times of occurrences of dimensions. This formula makes the quality of projected clustering based on our algorithm better than that of the FPC algorithm. It avoids the cluster containing too many irrelevant dimensions. From our simulation results, we show that our algorithm is better than the FPC algorithm,
in term of the execution time and the quality of clustering.
|
76 |
Development of a hierarchical k-selecting clustering algorithm – application to allergy.Malm, Patrik January 2007 (has links)
<p>The objective with this Master’s thesis was to develop, implement and evaluate an iterative procedure for hierarchical clustering with good overall performance which also merges features of certain already described algorithms into a single integrated package. An accordingly built tool was then applied to an allergen IgE-reactivity data set. The finally implemented algorithm uses a hierarchical approach which illustrates the emergence of patterns in the data. At each level of the hierarchical tree a partitional clustering method is used to divide data into k groups, where the number k is decided through application of cluster validation techniques. The cross-reactivity analysis, by means of the new algorithm, largely arrives at anticipated cluster formations in the allergen data, which strengthen results obtained through previous studies on the subject. Notably, though, certain unexpected findings presented in the former analysis where aggregated differently, and more in line with phylogenetic and protein family relationships, by the novel clustering package.</p>
|
77 |
Modern aspects of unsupervised learningLiang, Yingyu 27 August 2014 (has links)
Unsupervised learning has become more and more important due to the recent explosion of data. Clustering, a key topic in unsupervised learning, is a well-studied task arising in many applications ranging from computer vision to computational biology to the social sciences. This thesis is a collection of work exploring two modern aspects of clustering: stability and scalability.
In the first part, we study clustering under a stability property called perturbation resilience. As an alternative approach to worst case analysis, this novel theoretical framework aims at understanding the complexity of clustering instances that satisfy natural stability assumptions. In particular, we show how to correctly cluster instances whose optimal solutions are resilient to small multiplicative perturbations on the distances between data points, significantly improving existing guarantees. We further propose a generalized property that allows small changes in the optimal solutions after perturbations, and provide the first known positive results in this more challenging setting.
In the second part, we study the problem of clustering large scale data distributed across nodes which communicate over the edges of a connected graph. We provide algorithms with small communication cost and provable guarantees on the clustering quality. We also propose algorithms for distributed principal component analysis, which can be used to reduce the communication cost of clustering high dimensional data while merely comprising the clustering quality.
In the third part, we study community detection, the modern extension of clustering to network data. We propose a theoretical model of communities that are stable in the presence of noisy nodes in the network, and design an algorithm that provably detects all such communities. We also provide a local algorithm for large scale networks, whose running time depends on the sizes of the output communities but not that of the entire network.
|
78 |
Cooperative Based Software Clustering on Dependency GraphsIbrahim, Ahmed Fakhri 18 June 2014 (has links)
The organization of software systems into subsystems is usually based on the
constructs of packages or modules and has a major impact on the maintainability of
the software. However, during software evolution, the organization of the system is
subject to continual modification, which can cause it to drift away from the original
design, often with the effect of reducing its quality.
A number of techniques for evaluating a system's maintainability and for controlling
the effort required to conduct maintenance activities involve software clustering.
Software clustering refers to the partitioning of software system components
into clusters in order to obtain both exterior and interior connectivity between these
components. It helps maintainers enhance the quality of software modularization
and improve its maintainability.
Research in this area has produced numerous algorithms with a variety of
methodologies and parameters. This thesis presents a novel ensemble approach
that synthesizes a new solution from the outcomes of multiple constituent clustering
algorithms. The main principle behind this approach derived from machine
learning, as applied to document clustering, but it has been modified, both conceptually
and empirically, for use in software clustering. The conceptual modifications
include working with a variable number of clusters produced by the input algorithms
and employing graph structures rather than feature vectors. The empirical
modifications include experiments directed at the selection of the optimal cluster merging criteria. Case studies based on open source software systems show that
establishing cooperation between leading state-of-the-art algorithms produces better
clustering results compared with those achieved using only one of any of the
algorithms considered.
|
79 |
Evaluating Clusterings by Estimating ClarityWhissell, John January 2012 (has links)
In this thesis I examine clustering evaluation, with a subfocus on text clusterings specifically. The principal work
of this thesis is the development, analysis, and testing of a new internal clustering quality measure called informativeness.
I begin by reviewing clustering in general. I then review current clustering
quality measures, accompanying this with an in-depth discussion of many of the important properties one needs to understand about such measures. This is followed by extensive document clustering experiments that show problems with standard clustering evaluation practices.
I then develop informativeness, my new internal clustering quality measure for estimating the clarity of clusterings. I show that informativeness, which uses classification accuracy as a proxy for human assessment of clusterings, is both theoretically sensible and works empirically. I present a generalization of informativeness that leverages external clustering quality measures. I also show its use in a realistic application: email spam filtering. I show that informativeness can be used to select clusterings which lead to superior spam filters when few true labels are available.
I conclude this thesis with a discussion of clustering evaluation in general, informativeness, and the directions I believe clustering evaluation research should take in the future.
|
80 |
Estudo e desenvolvimento de algoritmos para agrupamento fuzzy de dados em cenários centralizados e distribuídos / Study and development of fuzzy clustering algorithms in centralized and distributed scenariosLucas Vendramin 05 July 2012 (has links)
Agrupamento de dados é um dos problemas centrais na áea de mineração de dados, o qual consiste basicamente em particionar os dados em grupos de objetos mais similares (ou relacionados) entre si do que aos objetos dos demais grupos. Entretanto, as abordagens tradicionais pressupõem que cada objeto pertence exclusivamente a um único grupo. Essa hipótese não é realista em várias aplicações práticas, em que grupos de objetos apresentam distribuições estatísticas que possuem algum grau de sobreposição. Algoritmos de agrupamento fuzzy podem lidar naturalmente com problemas dessa natureza. A literatura sobre agrupamento fuzzy de dados é extensa, muitos algoritmos existem atualmente e são mais (ou menos) apropriados para determinados cenários, por exemplo, na procura por grupos que apresentam diferentes formatos ou ao operar sobre dados descritos por conjuntos de atributos de tipos diferentes. Adicionalmente, existem cenários em que os dados podem estar distribuídos em diferentes locais (sítios de dados). Nesses cenários o objetivo de um algoritmo de agrupamento de dados consiste em encontrar uma estrutura que represente os dados existentes nos diferentes sítios sem a necessidade de transmissão e armazenamento/processamento centralizado desses dados. Tais algoritmos são denominados algoritmos de agrupamento distribuído de dados. O presente trabalho visa o estudo e aperfeiçoamento de algoritmos de agrupamento fuzzy centralizados e distribuídos existentes na literatura, buscando identificar as principais características, vantagens, desvantagens e cenários mais apropriados para a aplicação de cada um deles, incluindo análises de complexidade de tempo, espaço e de comunicação para os algoritmos distribuídos / Data clustering is a fundamental conceptual problem in data mining, in which one aims at determining a finite set of categories to describe a data set according to similarities among its objects. Traditional algorithms assume that each object belongs exclusively to a single cluster. This may be not realistic in many applications, in which groups of objects present statistical distributions with some overlap. Fuzzy clustering algorithms can naturally deal with these problems. The literature on fuzzy clustering is extensive, several fuzzy clustering algorithms with different characteristics and for different purposes have been proposed and investigated and are more (or less) suitable for specific scenarios, e.g., finding clusters with different shapes or working with data sets described by different types of attributes. Additionally, there are scenarios in which the data are (or can be) distributed among different sites. In these scenarios, the goal of a clustering algorithm consists in finding a structure that describes the distributed data without the need of data and processing centralization. Such algorithms are known as distributed clustering algorithms. The present document aims at the study and improvement of centralized and distributed fuzzy clustering algorithms, identifying the main characteristics, advantages, disadvantages and appropriate scenarios for each application, including complexity analysis of time, space and communication for the distributed algorithms
|
Page generated in 0.2632 seconds