311 |
Scalable algorithms for correlation clustering on large graphsCordner, Nathan 01 October 2024 (has links)
Correlation clustering (CC) is a widely-used clustering paradigm, where objects are represented as graph nodes and clustering is performed based on relationships between objects (positive or negative edges between pairs of nodes). The CC objective is to obtain a graph clustering that minimizes the number of incorrectly assigned edges (negative edges within clusters, and positive edges between clusters).
Many of the state-of-the-art algorithms for solving correlation clustering rely on subroutines that cause significant memory and run time bottlenecks when applied to larger graphs. Several algorithms with the best theoretical guarantees for clustering quality need to first solve a relatively large linear program; others perform brute-force searches over sizeable sets, or store large amounts of unnecessary information. Because of these issues, algorithms that run quicker (e.g. in linear time) but have lower quality approximation guarantees have still remained popular.
In this thesis we examine three such popular linear time CC algorithms: Pivot, Vote, and LocalSearch. For the general CC problem we show that these algorithms perform well against slower state-of-the-art algorithms; we also develop a lightweight InnerLocalSearch method that runs much faster and delivers nearly the same quality of results as the full LocalSearch. We adapt Pivot, Vote, and LocalSearch for two constrained CC variants (limited cluster sizes, and limited total number of clusters), and show their practicality when compared against slower algorithms with better approximation guarantees. Finally, we give two practical run time improvements for applying CC algorithms to the related consensus clustering problem.
|
312 |
Dimension Reduction and Clustering for Interactive Visual AnalyticsWenskovitch Jr, John Edward 06 September 2019 (has links)
When exploring large, high-dimensional datasets, analysts often utilize two techniques for reducing the data to make exploration more tractable. The first technique, dimension reduction, reduces the high-dimensional dataset into a low-dimensional space while preserving high-dimensional structures. The second, clustering, groups similar observations while simultaneously separating dissimilar observations. Existing work presents a number of systems and approaches that utilize these techniques; however, these techniques can cooperate or conflict in unexpected ways.
The core contribution of this work is the systematic examination of the design space at the intersection of dimension reduction and clustering when building intelligent, interactive tools in visual analytics. I survey existing techniques for dimension reduction and clustering algorithms in visual analytics tools, and I explore the design space for creating projections and interactions that include dimension reduction and clustering algorithms in the same visual interface. Further, I implement and evaluate three prototype tools that implement specific points within this design space. Finally, I run a cognitive study to understand how analysts perform dimension reduction (spatialization) and clustering (grouping) operations. Contributions of this work include surveys of existing techniques, three interactive tools and usage cases demonstrating their utility, design decisions for implementing future tools, and a presentation of complex human organizational behaviors. / Doctor of Philosophy / When an analyst is exploring a dataset, they seek to gain insight from the data. With data sets growing larger, analysts require techniques to help them reduce the size of the data while still maintaining its meaning. Two commonly-utilized techniques are dimension reduction and clustering. Dimension reduction seeks to eliminate unnecessary features from the data, reducing the number of columns to a smaller number. Clustering seeks to group similar objects together, reducing the number of rows to a smaller number. The contribution of this work is to explore how dimension reduction and clustering are currently being used in interactive visual analytics systems, as well as to explore how they could be used to address challenges faced by analysts in the future. To do so, I survey existing techniques and explore the design space for creating visualizations that incorporate both types of computations. I look at methods by which an analyst could interact with those projections in other to communicate their interests to the system, thereby producing visualizations that better match the needs of the analyst. I develop and evaluate three tools that incorporate both dimension reduction and clustering in separate computational pipelines. Finally, I conduct a cognitive study to better understand how users think about these operations, in order to create guidelines for better systems in the future.
|
313 |
Optimal Clustering: Genetic Constrained K-Means and Linear Programming AlgorithmsZhao, Jianmin 01 January 2006 (has links)
Methods for determining clusters of data under- specified constraints have recently gained popularity. Although general constraints may be used, we focus on clustering methods with the constraint of a minimal cluster size. In this dissertation, we propose two constrained k-means algorithms: Linear Programming Algorithm (LPA) and Genetic Constrained K-means Algorithm (GCKA). Linear Programming Algorithm modifies the k-means algorithm into a linear programming problem with constraints requiring that each cluster have m or more subjects. In order to achieve an acceptable clustering solution, we run the algorithm with a large number of random sets of initial seeds, and choose the solution with minimal Root Mean Squared Error (RMSE) as our final solution for a given data set. We evaluate LPA with both generic data and simulated data and the results indicate that LPA can obtain a reasonable clustering solution. Genetic Constrained K-Means Algorithm (GCKA) hybridizes the Genetic Algorithm with a constrained k-means algorithm. We define Selection Operator, Mutation Operator and Constrained K-means operator. Using finite Markov chain theory, we prove that the GCKA converges in probability to the global optimum. We test the algorithm with several datasets. The analysis shows that we can achieve a good clustering solution by carefully choosing parameters such as population size, mutation probability and generation. We also propose a Bi-Nelder algorithm to search for an appropriate cluster number with minimal RMSE.
|
314 |
A Polymorphic Ant-Based Algorithm for Graph ClusteringLiu, Ying Ying, Liu, Ying Ying 12 April 2016 (has links)
In this thesis, I introduce two new algorithms: Ant Brood Clustering-Intelligent Ants (ABC-INTE) and Ant Brood Clustering-Polymorphic Ants (ABC-POLY) for the graph clustering problem. ABC-INTE uses techniques such as hopping ants, relaxed drop function, ants with memories, stagnation control, and addition of k-means cluster retrieval process, as an improvement of the basic ABC-KLS algorithm. ABC-POLY uses two types of ants, inspired by the division of labour between the major and minor ants in Pheidole genus, as an improvement of ABC-INTE. For comparison purpose, I also implement MMAS, an ACO clustering algorithm. When tested on the benchmark networks, ABC-POLY outperforms or achieves the same modularity values as MMAS and ABC-INTE on 7 out of 10 networks and is robust against different graphs. In practice, the speed of ABC-POLY is at least 10 times faster than MMAS, making it a scalable algorithm compared to MMAS. ABC-POLY also outputs a direct visual representation of the natural clusters on the graph that is appealing to human observation. This thesis opens an interesting research topic to apply polymorphic ants for graph clustering in the ABC-POLY algorithm. The distributive and self-organization nature of ABC-POLY makes it a candidate for analyzing clusters in more complex and dynamic graphs. / May 2016
|
315 |
Clustering de trajetórias / Trajectory clusteringOshiro, Marcio Takashi Iura 16 September 2015 (has links)
Esta tese teve como objetivo estudar problemas cinéticos de clustering, ou seja, problemas de clustering nos quais os objetos se movimentam. O trabalho se concentrou no caso unidimensional, em que os objetos são pontos se movendo na reta real. Diversas variantes desse caso foram abordadas. Em termos do movimento, consideramos o caso em que cada ponto se move com uma velocidade constante num dado intervalo de tempo, o caso em que os pontos se movem arbitrariamente e temos apenas as suas posições em instantes discretos de tempo, o caso em que os pontos se movem com uma velocidade aleatória em que se conhece apenas o valor esperado da velocidade, e o caso em que, dada uma partição do intervalo de tempo, os pontos se movem com velocidades constantes em cada subintervalo. Em termos do tipo de clustering buscado, nos concentramos no caso em que o número de clusters é um dado do problema e consideramos diferentes medidas de qualidade para o clustering. Duas delas são tradicionais para problemas de clustering: a soma dos diâmetros dos clusters e o diâmetro máximo de um cluster. A terceira medida considerada leva em conta a característica cinética do problema, e permite, de uma maneira controlada, que o clustering mude com o tempo. Para cada uma das variantes do problema, são apresentados algoritmos, exatos ou de aproximação, alguns resultados de complexidade obtidos, e questões que ficaram em aberto. / This work aimed to study kinetic problems of clustering, i.e., clustering problems in which the objects are moving. The study focused on the unidimensional case, where the objects are points moving on the real line. Several variants of this case have been discussed. Regarding the movement, we consider the case where each point moves at a constant velocity in a given time interval, the case where the points move arbitrarily and we only know their positions in discrete time instants, the case where the points move at a random velocity in which only the expected value of the velocity is known, and the case where, given a partition of the time interval, the points move at constant velocities in each sub-interval. Regarding the kind of clustering sought, we focused in the case where the number of clusters is part of the input of the problem and we consider different measures of quality for the clustering. Two of them are traditional measures for clustering problems: the sum of the cluster diameters and the maximum diameter of a cluster. The third measure considered takes into account the kinetic characteristic of the problem, and allows, in a controlled manner, that a cluster change along time. For each of the variants of the problem, we present algorithms, exact or approximation, some obtained complexity results, and open questions.
|
316 |
Algoritmos e técnicas de validação em agrupamento de dados multi-representados, agrupamento possibilístico e bi-agrupamento / Algorithms and validation techniques in multi-represented data clustering, possibilistic clustering and bi-clusteringHorta, Danilo 25 November 2013 (has links)
Existem bases para as quais os dados são naturalmente representados por mais de uma visão. Por exemplo, imagens podem ser descritas por atributos de cores, textura e forma. Proteínas podem ser caracterizadas pela sequência de aminoácidos e pela representação tridimensional. A unificação das diferentes visões de uma base de dados pode ser problemática porque elas podem não ser comparáveis entre si ou podem apresentar diferentes graus de importância. Esses graus de importância podem, inclusive, se manifestar de maneira local, de acordo com a subestrutura dos dados em questão. Isso motivou o surgimento de algoritmos de agrupamento de dados capazes de lidar com bases multi-representadas (i.e., que possuem mais de uma visão dos dados), como o algoritmo SCAD. Esse algoritmo se mostrou promissor em experimentos relatados na literatura, mas possui problemas críticos identificados neste trabalho que o impedem de funcionar em determinados cenários. Tais problemas foram solucionados por meio da proposição de uma nova versão do algoritmo, denominada ASCAD, fundamentada em provas formais sobre a sua convergência. Foram desenvolvidas versões relacionais do algoritmo ASCAD, capazes de lidar com bases descritas apenas por relações de proximidade entre os objetos. Foi desenvolvido também um índice de validação interna e relativa de agrupamento voltado para dados multi-representados. A avaliação de agrupamento possibilístico e de bi-agrupamento por meio da comparação entre solução encontrada e solução de referência (validação externa) também foi explorada. Algoritmos de bi-agrupamento têm ganhado um interesse crescente da comunidade de análise de expressão gênica. No entanto, pouco se conhece do comportamento e das propriedades das medidas voltadas para validação externa de bi-agrupamento, o que motivou uma análise teórica e empírica dessas medidas. Essa análise mostrou que a maioria das medidas de biagrupamento possui problemas críticos e destacou duas delas como sendo as mais promissoras. Foram inclusas nessa análise três medidas de agrupamento particional não exclusivo, cujo uso na comparação de bi-agrupamentos é possível por meio de uma nova abordagem de avaliação de bi-agrupamento proposta nesta tese. Agrupamento particional não exclusivo faz parte de um domínio mais geral de soluções, i.e., o domínio dos agrupamentos possibilísticos. Observou-se algumas falhas conceituais importantes das medidas de agrupamento possibilístico, o que motivou o desenvolvimento de novas medidas e de uma análise empírica e conceitual envolvendo 34 medidas. Uma das medidas propostas se destacou como sendo a única que apresentou avaliações imparciais com relação ao número de grupos, o valor máximo de similaridade ao comparar a solução ideal encontrada com a solução de referência e avaliações sensíveis às diferenças das soluções em todos os cenários considerados / There are data sets for which the instances are naturally represented by more than one view. For example, images can be described by attributes of color, texture, and shape. Proteins can be characterized by the amino acid sequence and by their three-dimensional description. The unification of different views of a data set can be problematic because they may not be comparable or may have different degrees of importance. These degrees of importance may even manifest itself locally, according to the data substructures. This prompted the emergence of clustering algorithms capable of handling multi-represented data sets (i.e., data sets having more than one view) as the SCAD algorithm. This algorithm has shown promising results in experiments reported in the literature, but it has critical problems identified in this work that hinder its application in certain scenarios. These problems were solved here by proposing a new version of the algorithm, called ASCAD, based on formal proofs about its correctness. We developed relational versions for ASCAD, capable of handling data sets described only by the proximities between the instances. We also developed an index for internal and relative validation of multi-represented data clusterings. The evaluation of possibilistic clustering and bi-clustering by comparing the found and reference solutions (external validation) was also explored. Bi-clustering algorithms have gained increasing interest from the community of gene expression analysis. However, little is known of the behavior and properties of the measures aimed at external validation of bi-clustering, which motivated a theoretical and empirical analysis of these measures in this work. This analysis showed that most bi-clustering measures has critical issues and highlighted two of the measures as being the most promising. We included in this analysis three measures of non-exclusive partitional clustering, whose use in comparing bi-clusterings is possible through a new approach proposed in this thesis. Non-exclusive partitional clustering belong to a more general domain of solutions, i.e., the domain of possibilistic clusterings. There are some important conceptual flaws in the measures of possibilistic clustering, which motivated us to develop new measures and to conceptually and empirically analyse 34 measures. One of the proposed measures stood out as being the one who presented unbiased evaluations regarding the number of clusters, the maximum similarity when comparing the optimal solution with the reference one, and evaluations sensitive to solution differences in all scenarios considered
|
317 |
Propriedades assintóticas e estimadores consistentes para a probabilidade de clustering / Asymptotic properties and consistent estimators for the clustering probabilityMelo, Mariana Pereira de 23 May 2014 (has links)
Considere um processo estocástico X_m em tempo discreto definido sobre o alfabeto finito A. Seja x_0^k-1 uma palavra fixa sobre A^k. No estudo das propriedades estatísticas na teoria de recorrência de Poincaré, é clássico o estudo do tempo decorrente até que a sequência fixa x_0^k-1 seja encontrada em uma realização do processo. Tipicamente, esta é uma quantidade exponencialmente grande com relação ao comprimento da palavra. Contrariamente, o primeiro tempo de retorno possível para uma sequência dada está definido como sendo o mínimo entre os tempos de entrada de todas as sequências que começam com a própria palavra e é uma quantidade tipicamente pequena, da ordem do tamanho da palavra. Neste trabalho estudamos o comportamento da probabilidade deste primeiro retorno possível de uma palavra x_0^k-1 dado que o processo começa com ela mesma. Esta quantidade mede a intensidade de que, uma vez observado um conjunto alvo, possam ser observados agrupamentos ou clusters. Provamos que, sob certas condições, a taxa de decaimento exponencial desta probabilidade converge para a entropia para quase toda a sequência quando k diverge. Apresentamos também um estimador desta probabilidade para árvores de contexto e mostramos sua consistência. / Considering a stochastic process X_m in a discrete defined time over a finite alphabet A and x_0^k-1 a fixed word over A^k. In the study of the statistical properties of the Poincaré recurrence theory, it is usual the study of the time elapsed until a fixed sequence x_0^k-1 appears in a given realization of process. This quantity is known as the hitting time and it is usually exponentially large in relation to the size of word. On the opposite, the first possible return time of a given word is defined as the minimum among all the hitting times of realizations that begins with the given word x_0^k-1. This quantity is tipically small that is of the order of the length of the sequence. In this work, we study the probability of the first possible return time given that the process begins of the target word. This quantity measures the intensity of that, once observed the target set, it can be observed in clusters. We show that, under certain conditions, the exponential decay rate of this probability converges to the entropy for all almost every word x_0^k-1 as k diverges. We also present an estimator of this probability for context trees and shows its consistency.
|
318 |
Un protocole de session dans les réseaux de capteurs sans fils / A session protocol in wireless sensor networksHarchi, Said 06 December 2013 (has links)
Les réseaux de capteurs sans fils sont de plus en plus utilisés dans des applications de surveillance de grands systèmes (feux de forêt, gaz dans les galeries minières, éthologie, ...). Une caractéristique de ces applications est que la topologie du réseau va être dynamique : soit les capteurs sont géographiquement mobiles (dispersion d'une nappe de pétrole), soit les conditions environnementales évoluent et modifient les capacités de communication des capteurs entre eux. Aussi, d'un système connexe, on peut évoluer vers un système clustérisé qui présente une rupture de la connectivité globale, et donc du système d'information. Une solution consiste à utiliser un (ou des) collecteur(s) des mesures (par exemple un robot mobile) qui va rétablir une connectivité discrète pour reconstituer à des échéances fixes un système d'information cohérent. Nous avons proposé un algorithme de clustering du réseau de capteurs sans fils adapté à la dynamique de sa topologie. La métrique choisie prend en compte la densité et la mobilité des noeuds, ainsi que leur énergie résiduelle. Ensuite, nous avons conçu un protocole de couche session permettant au collecteur de reconstruire le contexte de communication avec les clusters précédemment visités, sachant qu'ils ont pu évoluer en nombre, dispersion, fusion, ... Pour ce faire, il faut générer dynamiquement une trajectoire optimale du collecteur, en se basant sur un modèle d'estimation de la topologie, en prenant en compte les exigences applicatives (fréquence et volume des informations remontées). Pour la validation de l'algorithme de clustering et du protocole de couche session proposés, nous avons défini un modèle de noeud capteur que nous avons intégré à l'environnement de simulation Opnet / Wireless sensor networks are increasingly used in applications for monitoring large systems (forest fires, gas in the mine galleries, ethology, ...). A characteristic of these applications is that the topology of the network will be dynamic, either the sensors are geographically mobile (dispersion of an oil slick) or environmental conditions change and modify the communication capabilities of these sensors. Also, from a connex system, we can move to a clustered system that presents a discontinuity of the global connectivity, and therefore of the information system. One solution is to use one (or more) collector (s) that will restore a discrete connectivity at fixed deadlines to reconstruct a coherent information system. We have proposed a clustering algorithm of the wireless sensor network which is adapted to the dynamics of its topology. The chosen metric takes into account the density and the mobility of nodes and their remaining energy. Then we designed a session-layer protocol allowing the collector to reconstruct the context of communication with the previously visited clusters, knowing that they have evolved in number, splitting, merging, ... To do this, it is necessary to dynamically generate the trajectory of the collector, on the basis of a model of the topology, taking into account the application requirements (frequency and volume of the collected information). For the validation of the proposed clustering algorithm and the session-layer protocol, we defined a sensor model we have integrated in the Opnet simulation environment
|
319 |
Sistemáticas de agrupamento de países com base em indicadores de desempenho / Countries clustering systematics based on performance indexesMello, Paula Lunardi de January 2017 (has links)
A economia mundial passou por grandes transformações no último século, as quais incluiram períodos de crescimento sustentado seguidos por outros de estagnação, governos alternando estratégias de liberalização de mercado com políticas de protecionismo comercial e instabilidade nos mercados, dentre outros. Figurando como auxiliar na compreensão de problemas econômicos e sociais de forma sistêmica, a análise de indicadores de desempenho é capaz de gerar informações relevantes a respeito de padrões de comportamento e tendências, além de orientar políticas e estratégias para incremento de resultados econômicos e sociais. Indicadores que descrevem as principais dimensões econômicas de um país podem ser utilizados como norteadores na elaboração e monitoramento de políticas de desenvolvimento e crescimento desses países. Neste sentido, esta dissertação utiliza dados do Banco Mundial para aplicar e avaliar sistemáticas de agrupamento de países com características similares em termos dos indicadores que os descrevem. Para tanto, integra técnicas de clusterização (hierárquicas e não-hierárquicas), seleção de variáveis (por meio da técnica “leave one variable out at a time”) e redução dimensional (através da Análise de Componentes Principais) com vistas à formação de agrupamentos consistentes de países. A qualidade dos clusters gerados é avaliada pelos índices Silhouette, Calinski-Harabasz e Davies-Bouldin. Os resultados se mostraram satisfatórios quanto à representatividade dos indicadores destacados e qualidade da clusterização gerada. / The world economy faced transformations in the last century. Periods of sustained growth followed by others of stagnation, governments alternating strategies of market liberalization with policies of commercial protectionism, and instability in markets, among others. As an aid to understand economic and social problems in a systemic way, the analysis of performance indicators generates relevant information about patterns, behavior and trends, as well as guiding policies and strategies to increase results in economy and social issues. Indicators describing main economic dimensions of a country can be used guiding principles in the development and monitoring of development and growth policies of these countries. In this way, this dissertation uses data from World Bank to elaborate a system of grouping countries with similar characteristics in terms of the indicators that describe them. To do so, it integrates clustering techniques (hierarchical and non-hierarchical), selection of variables (through the "leave one variable out at a time" technique) and dimensional reduction (appling Principal Component Analysis). The generated clusters quality is evaluated by the Silhouette Index, Calinski-Harabasz and Davies-Bouldin indexes. The results were satisfactory regarding the representativity of the highlighted indicators and the generated a good clustering quality.
|
320 |
Contributions au clustering collaboratif et à ses potentielles applications en imagerie à très haute résolution / Contributions to collaborative clustering and its potential applications on very high resolution satellite imagesSublime, Jérémie 09 November 2016 (has links)
Cette thèse présente plusieurs algorithmes développés dans le cadre du projet ANR COCLICO et contient deux axes principaux :Le premier axe concerne l'introduction d'un algorithme applicable aux images satellite à très haute résolution, qui est basé sur les champs aléatoires de Markov et qui apporte des notions sémantiques sur les clusters découverts. Cet algorithme est inspiré de l'algorithme Iterated conditional modes (ICM) et permet de faire un clustering sur des segments d'images pré-traitées. La méthode que nous proposons permet de gérer des voisinages irréguliers entre segments et d'obtenir des informations sémantiques de bas niveau sur les clusters de l'image traitée.Le second axe porte sur le développement de méthodes de clustering collaboratif applicables à autant d'algorithmes que possible, ce qui inclut les algorithmes du premier axe. La caractéristique principale des méthodes proposées dans cette thèse est leur applicabilité aux deux cas suivants : 1) plusieurs algorithmes travaillant sur les mêmes objets dans des espaces de représentation différents, 2) plusieurs algorithmes travaillant sur des données différentes ayant des distributions similaires. Les méthodes que nous proposons peuvent s'appliquer à de nombreux algorithmes comme l'ICM, les K-Moyennes, l'algorithme EM, ou les cartes topographiques (SOM et GTM). Contrairement aux méthodes précédemment proposées, notre modèle permet à des algorithmes très différents de collaborer ensemble, n'impose pas de contrainte sur le nombre de clusters recherchés et a une base mathématique solide. / This thesis presents several algorithms developed in the context of the ANR COCLICO project and contains two main axis: The first axis is concerned with introducing Markov Random Fields (MRF) based models to provide a semantic rich and suited algorithm applicable to images that are already segmented. This method is based on the Iterated Conditional Modes Algorithm (ICM algorithm) and can be applied to the segments of very high resolution (VHR) satellite pictures. Our proposed method can cope with highly irregular neighborhood dependencies and provides some low level semantic information on the clusters and their relationship within the image. The second axis deals with collaborative clustering methods developed with the goal of being applicable to as many clustering algorithms as possible, including the algorithms used in the first axis of this work. A key feature of the methods proposed in this thesis is that they can deal with either of the following two cases: 1) several clustering algorithms working together on the same data represented in different feature spaces, 2) several clustering algorithms looking for similar clusters in different data sets having similar distributions. Clustering algorithms to which these methods are applicable include the ICM algorithm, the K-Means algorithm, density based algorithms such as DB-scan, all Expectation-Maximization (EM) based algorithms such as the Self-Organizing Maps (SOM) and the Generative Topographic Mapping (GTM) algorithms. Unlike previously introduced methods, our models have no restrictions in term of types of algorithms that can collaborate together, do not require that all methods be looking for the same number of clusters, and are provided with solid mathematical foundations.
|
Page generated in 0.0896 seconds