1 |
Estudo e desenvolvimento de algoritmos para agrupamento fuzzy de dados em cenários centralizados e distribuídos / Study and development of fuzzy clustering algorithms in centralized and distributed scenariosVendramin, Lucas 05 July 2012 (has links)
Agrupamento de dados é um dos problemas centrais na áea de mineração de dados, o qual consiste basicamente em particionar os dados em grupos de objetos mais similares (ou relacionados) entre si do que aos objetos dos demais grupos. Entretanto, as abordagens tradicionais pressupõem que cada objeto pertence exclusivamente a um único grupo. Essa hipótese não é realista em várias aplicações práticas, em que grupos de objetos apresentam distribuições estatísticas que possuem algum grau de sobreposição. Algoritmos de agrupamento fuzzy podem lidar naturalmente com problemas dessa natureza. A literatura sobre agrupamento fuzzy de dados é extensa, muitos algoritmos existem atualmente e são mais (ou menos) apropriados para determinados cenários, por exemplo, na procura por grupos que apresentam diferentes formatos ou ao operar sobre dados descritos por conjuntos de atributos de tipos diferentes. Adicionalmente, existem cenários em que os dados podem estar distribuídos em diferentes locais (sítios de dados). Nesses cenários o objetivo de um algoritmo de agrupamento de dados consiste em encontrar uma estrutura que represente os dados existentes nos diferentes sítios sem a necessidade de transmissão e armazenamento/processamento centralizado desses dados. Tais algoritmos são denominados algoritmos de agrupamento distribuído de dados. O presente trabalho visa o estudo e aperfeiçoamento de algoritmos de agrupamento fuzzy centralizados e distribuídos existentes na literatura, buscando identificar as principais características, vantagens, desvantagens e cenários mais apropriados para a aplicação de cada um deles, incluindo análises de complexidade de tempo, espaço e de comunicação para os algoritmos distribuídos / Data clustering is a fundamental conceptual problem in data mining, in which one aims at determining a finite set of categories to describe a data set according to similarities among its objects. Traditional algorithms assume that each object belongs exclusively to a single cluster. This may be not realistic in many applications, in which groups of objects present statistical distributions with some overlap. Fuzzy clustering algorithms can naturally deal with these problems. The literature on fuzzy clustering is extensive, several fuzzy clustering algorithms with different characteristics and for different purposes have been proposed and investigated and are more (or less) suitable for specific scenarios, e.g., finding clusters with different shapes or working with data sets described by different types of attributes. Additionally, there are scenarios in which the data are (or can be) distributed among different sites. In these scenarios, the goal of a clustering algorithm consists in finding a structure that describes the distributed data without the need of data and processing centralization. Such algorithms are known as distributed clustering algorithms. The present document aims at the study and improvement of centralized and distributed fuzzy clustering algorithms, identifying the main characteristics, advantages, disadvantages and appropriate scenarios for each application, including complexity analysis of time, space and communication for the distributed algorithms
|
2 |
Modern aspects of unsupervised learningLiang, Yingyu 27 August 2014 (has links)
Unsupervised learning has become more and more important due to the recent explosion of data. Clustering, a key topic in unsupervised learning, is a well-studied task arising in many applications ranging from computer vision to computational biology to the social sciences. This thesis is a collection of work exploring two modern aspects of clustering: stability and scalability.
In the first part, we study clustering under a stability property called perturbation resilience. As an alternative approach to worst case analysis, this novel theoretical framework aims at understanding the complexity of clustering instances that satisfy natural stability assumptions. In particular, we show how to correctly cluster instances whose optimal solutions are resilient to small multiplicative perturbations on the distances between data points, significantly improving existing guarantees. We further propose a generalized property that allows small changes in the optimal solutions after perturbations, and provide the first known positive results in this more challenging setting.
In the second part, we study the problem of clustering large scale data distributed across nodes which communicate over the edges of a connected graph. We provide algorithms with small communication cost and provable guarantees on the clustering quality. We also propose algorithms for distributed principal component analysis, which can be used to reduce the communication cost of clustering high dimensional data while merely comprising the clustering quality.
In the third part, we study community detection, the modern extension of clustering to network data. We propose a theoretical model of communities that are stable in the presence of noisy nodes in the network, and design an algorithm that provably detects all such communities. We also provide a local algorithm for large scale networks, whose running time depends on the sizes of the output communities but not that of the entire network.
|
3 |
Estudo e desenvolvimento de algoritmos para agrupamento fuzzy de dados em cenários centralizados e distribuídos / Study and development of fuzzy clustering algorithms in centralized and distributed scenariosLucas Vendramin 05 July 2012 (has links)
Agrupamento de dados é um dos problemas centrais na áea de mineração de dados, o qual consiste basicamente em particionar os dados em grupos de objetos mais similares (ou relacionados) entre si do que aos objetos dos demais grupos. Entretanto, as abordagens tradicionais pressupõem que cada objeto pertence exclusivamente a um único grupo. Essa hipótese não é realista em várias aplicações práticas, em que grupos de objetos apresentam distribuições estatísticas que possuem algum grau de sobreposição. Algoritmos de agrupamento fuzzy podem lidar naturalmente com problemas dessa natureza. A literatura sobre agrupamento fuzzy de dados é extensa, muitos algoritmos existem atualmente e são mais (ou menos) apropriados para determinados cenários, por exemplo, na procura por grupos que apresentam diferentes formatos ou ao operar sobre dados descritos por conjuntos de atributos de tipos diferentes. Adicionalmente, existem cenários em que os dados podem estar distribuídos em diferentes locais (sítios de dados). Nesses cenários o objetivo de um algoritmo de agrupamento de dados consiste em encontrar uma estrutura que represente os dados existentes nos diferentes sítios sem a necessidade de transmissão e armazenamento/processamento centralizado desses dados. Tais algoritmos são denominados algoritmos de agrupamento distribuído de dados. O presente trabalho visa o estudo e aperfeiçoamento de algoritmos de agrupamento fuzzy centralizados e distribuídos existentes na literatura, buscando identificar as principais características, vantagens, desvantagens e cenários mais apropriados para a aplicação de cada um deles, incluindo análises de complexidade de tempo, espaço e de comunicação para os algoritmos distribuídos / Data clustering is a fundamental conceptual problem in data mining, in which one aims at determining a finite set of categories to describe a data set according to similarities among its objects. Traditional algorithms assume that each object belongs exclusively to a single cluster. This may be not realistic in many applications, in which groups of objects present statistical distributions with some overlap. Fuzzy clustering algorithms can naturally deal with these problems. The literature on fuzzy clustering is extensive, several fuzzy clustering algorithms with different characteristics and for different purposes have been proposed and investigated and are more (or less) suitable for specific scenarios, e.g., finding clusters with different shapes or working with data sets described by different types of attributes. Additionally, there are scenarios in which the data are (or can be) distributed among different sites. In these scenarios, the goal of a clustering algorithm consists in finding a structure that describes the distributed data without the need of data and processing centralization. Such algorithms are known as distributed clustering algorithms. The present document aims at the study and improvement of centralized and distributed fuzzy clustering algorithms, identifying the main characteristics, advantages, disadvantages and appropriate scenarios for each application, including complexity analysis of time, space and communication for the distributed algorithms
|
4 |
Distributed Hierarchical ClusteringLoganathan, Satish Kumar January 2018 (has links)
No description available.
|
5 |
Distributed Document Clustering and Cluster Summarization in Peer-to-Peer EnvironmentsHammouda, Khaled M. January 2007 (has links)
This thesis addresses difficult challenges in distributed document clustering and cluster summarization. Mining large document collections poses many challenges, one of which is the extraction of topics or summaries from documents for the purpose of interpretation of clustering results. Another important challenge, which is caused by new trends in distributed repositories and peer-to-peer computing, is that document data is becoming more distributed.
We introduce a solution for interpreting document clusters using keyphrase extraction from multiple documents simultaneously. We also introduce two solutions for the problem of distributed document clustering in peer-to-peer environments, each satisfying a different goal: maximizing local clustering quality through collaboration, and maximizing global clustering quality through cooperation.
The keyphrase extraction algorithm efficiently extracts and scores candidate keyphrases from a document cluster. The algorithm is called CorePhrase and is based on modeling document collections as a graph upon which we can leverage graph mining to extract frequent and significant phrases, which are used to label the clusters. Results show that CorePhrase can extract keyphrases relevant to documents in a cluster with very high accuracy. Although this algorithm can be used to summarize centralized clusters, it is specifically employed within distributed clustering to both boost distributed clustering accuracy, and to provide summaries for distributed clusters.
The first method for distributed document clustering is called collaborative peer-to-peer document clustering, which models nodes in a peer-to-peer network as collaborative nodes with the goal of improving the quality of individual local clustering solutions. This is achieved through the exchange of local cluster summaries between peers, followed by recommendation of documents to be merged into remote clusters. Results on large sets of distributed document collections show that: (i) such collaboration technique achieves significant improvement in the final clustering of individual nodes; (ii) networks with larger number of nodes generally achieve greater improvements in clustering after collaboration relative to the initial clustering before collaboration, while on the other hand they tend to achieve lower absolute clustering quality than networks with fewer number of nodes; and (iii) as more overlap of the data is introduced across the nodes, collaboration tends to have little effect on improving clustering quality.
The second method for distributed document clustering is called hierarchically-distributed document clustering. Unlike the collaborative model, this model aims at producing one clustering solution across the whole network. It specifically addresses scalability of network size, and consequently the distributed clustering complexity, by modeling the distributed clustering problem as a hierarchy of node neighborhoods. Summarization of the global distributed clusters is achieved through a distributed version of the CorePhrase algorithm. Results on large document sets show that: (i) distributed clustering accuracy is not affected by increasing the number of nodes for networks of single level; (ii) we can achieve decent speedup by making the hierarchy taller, but on the expense of clustering quality which degrades as we go up the hierarchy; (iii) in networks that grow arbitrarily, data gets more fragmented across neighborhoods causing poor centroid generation, thus suggesting we should not increase the number of nodes in the network beyond a certain level without increasing the data set size; and (iv) distributed cluster summarization can produce accurate summaries similar to those produced by centralized summarization.
The proposed algorithms offer high degree of flexibility, scalability, and interpretability of large distributed document collections. Achieving the same results using current methodologies require centralization of the data first, which is sometimes not feasible.
|
6 |
Distributed Document Clustering and Cluster Summarization in Peer-to-Peer EnvironmentsHammouda, Khaled M. January 2007 (has links)
This thesis addresses difficult challenges in distributed document clustering and cluster summarization. Mining large document collections poses many challenges, one of which is the extraction of topics or summaries from documents for the purpose of interpretation of clustering results. Another important challenge, which is caused by new trends in distributed repositories and peer-to-peer computing, is that document data is becoming more distributed.
We introduce a solution for interpreting document clusters using keyphrase extraction from multiple documents simultaneously. We also introduce two solutions for the problem of distributed document clustering in peer-to-peer environments, each satisfying a different goal: maximizing local clustering quality through collaboration, and maximizing global clustering quality through cooperation.
The keyphrase extraction algorithm efficiently extracts and scores candidate keyphrases from a document cluster. The algorithm is called CorePhrase and is based on modeling document collections as a graph upon which we can leverage graph mining to extract frequent and significant phrases, which are used to label the clusters. Results show that CorePhrase can extract keyphrases relevant to documents in a cluster with very high accuracy. Although this algorithm can be used to summarize centralized clusters, it is specifically employed within distributed clustering to both boost distributed clustering accuracy, and to provide summaries for distributed clusters.
The first method for distributed document clustering is called collaborative peer-to-peer document clustering, which models nodes in a peer-to-peer network as collaborative nodes with the goal of improving the quality of individual local clustering solutions. This is achieved through the exchange of local cluster summaries between peers, followed by recommendation of documents to be merged into remote clusters. Results on large sets of distributed document collections show that: (i) such collaboration technique achieves significant improvement in the final clustering of individual nodes; (ii) networks with larger number of nodes generally achieve greater improvements in clustering after collaboration relative to the initial clustering before collaboration, while on the other hand they tend to achieve lower absolute clustering quality than networks with fewer number of nodes; and (iii) as more overlap of the data is introduced across the nodes, collaboration tends to have little effect on improving clustering quality.
The second method for distributed document clustering is called hierarchically-distributed document clustering. Unlike the collaborative model, this model aims at producing one clustering solution across the whole network. It specifically addresses scalability of network size, and consequently the distributed clustering complexity, by modeling the distributed clustering problem as a hierarchy of node neighborhoods. Summarization of the global distributed clusters is achieved through a distributed version of the CorePhrase algorithm. Results on large document sets show that: (i) distributed clustering accuracy is not affected by increasing the number of nodes for networks of single level; (ii) we can achieve decent speedup by making the hierarchy taller, but on the expense of clustering quality which degrades as we go up the hierarchy; (iii) in networks that grow arbitrarily, data gets more fragmented across neighborhoods causing poor centroid generation, thus suggesting we should not increase the number of nodes in the network beyond a certain level without increasing the data set size; and (iv) distributed cluster summarization can produce accurate summaries similar to those produced by centralized summarization.
The proposed algorithms offer high degree of flexibility, scalability, and interpretability of large distributed document collections. Achieving the same results using current methodologies require centralization of the data first, which is sometimes not feasible.
|
7 |
Bee clustering : um algoritmo para agrupamento de dados inspirado em inteligência de enxames / Bee clustering: a clustering algorithm inspired by swarm intelligenceSantos, Daniela Scherer dos January 2009 (has links)
Agrupamento de dados é o processo que consiste em dividir um conjunto de dados em grupos de forma que dados semelhantes entre si permaneçam no mesmo grupo enquanto que dados dissimilares sejam alocados em grupos diferentes. Técnicas tradicionais de agrupamento de dados têm sido usualmente desenvolvidas de maneira centralizada dependendo assim de estruturas que devem ser acessadas e modificadas a cada passo do processo de agrupamento. Além disso, os resultados gerados por tais métodos são dependentes de informações que devem ser fornecidas a priori como por exemplo número de grupos, tamanho do grupo ou densidade mínima/máxima permitida para o grupo. O presente trabalho visa propor o bee clustering, um algoritmo distribuído inspirado principalmente em técnicas de inteligência de enxames como organização de colônias de abelhas e alocação de tarefas em insetos sociais, desenvolvido com o objetivo de resolver o problema de agrupamento de dados sem a necessidade de pistas sobre o resultado desejado ou inicialização de parâmetros complexos. O bee clustering é capaz de formar grupos de agentes de maneira distribuída, uma necessidade típica em cenários de sistemas multiagente que exijam capacidade de auto-organização sem controle centralizado. Os resultados obtidos mostram que é possível atingir resultados comparáveis as abordagens centralizadas. / Clustering can be defined as a set of techniques that separate a data set into groups of similar objects. Data items within the same group are more similar than objects of different groups. Traditional clustering methods have been usually developed in a centralized fashion. One reason for this is that this form of clustering relies on data structures that must be accessed and modified at each step of the clustering process. Another issue with classical clustering methods is that they need some hints about the target clustering. These hints include for example the number of clusters, the expected cluster size, or the minimum density of clusters. In this work we propose a clustering algorithm that is inspired by swarm intelligence techniques such as the organization of bee colonies and task allocation among social insects. Our proposed algorithm is developed in a decentralized fashion without any initial information about number of classes, number of partitions, and size of partition, and without the need of complex parameters. The bee clustering algorithm is able to form groups of agents in a distributed way, a typical necessity in multiagent scenarios that require self-organization without central control. The performance of our algorithm shows that it is possible to achieve results that are comparable to those from centralized approaches.
|
8 |
Bee clustering : um algoritmo para agrupamento de dados inspirado em inteligência de enxames / Bee clustering: a clustering algorithm inspired by swarm intelligenceSantos, Daniela Scherer dos January 2009 (has links)
Agrupamento de dados é o processo que consiste em dividir um conjunto de dados em grupos de forma que dados semelhantes entre si permaneçam no mesmo grupo enquanto que dados dissimilares sejam alocados em grupos diferentes. Técnicas tradicionais de agrupamento de dados têm sido usualmente desenvolvidas de maneira centralizada dependendo assim de estruturas que devem ser acessadas e modificadas a cada passo do processo de agrupamento. Além disso, os resultados gerados por tais métodos são dependentes de informações que devem ser fornecidas a priori como por exemplo número de grupos, tamanho do grupo ou densidade mínima/máxima permitida para o grupo. O presente trabalho visa propor o bee clustering, um algoritmo distribuído inspirado principalmente em técnicas de inteligência de enxames como organização de colônias de abelhas e alocação de tarefas em insetos sociais, desenvolvido com o objetivo de resolver o problema de agrupamento de dados sem a necessidade de pistas sobre o resultado desejado ou inicialização de parâmetros complexos. O bee clustering é capaz de formar grupos de agentes de maneira distribuída, uma necessidade típica em cenários de sistemas multiagente que exijam capacidade de auto-organização sem controle centralizado. Os resultados obtidos mostram que é possível atingir resultados comparáveis as abordagens centralizadas. / Clustering can be defined as a set of techniques that separate a data set into groups of similar objects. Data items within the same group are more similar than objects of different groups. Traditional clustering methods have been usually developed in a centralized fashion. One reason for this is that this form of clustering relies on data structures that must be accessed and modified at each step of the clustering process. Another issue with classical clustering methods is that they need some hints about the target clustering. These hints include for example the number of clusters, the expected cluster size, or the minimum density of clusters. In this work we propose a clustering algorithm that is inspired by swarm intelligence techniques such as the organization of bee colonies and task allocation among social insects. Our proposed algorithm is developed in a decentralized fashion without any initial information about number of classes, number of partitions, and size of partition, and without the need of complex parameters. The bee clustering algorithm is able to form groups of agents in a distributed way, a typical necessity in multiagent scenarios that require self-organization without central control. The performance of our algorithm shows that it is possible to achieve results that are comparable to those from centralized approaches.
|
9 |
Bee clustering : um algoritmo para agrupamento de dados inspirado em inteligência de enxames / Bee clustering: a clustering algorithm inspired by swarm intelligenceSantos, Daniela Scherer dos January 2009 (has links)
Agrupamento de dados é o processo que consiste em dividir um conjunto de dados em grupos de forma que dados semelhantes entre si permaneçam no mesmo grupo enquanto que dados dissimilares sejam alocados em grupos diferentes. Técnicas tradicionais de agrupamento de dados têm sido usualmente desenvolvidas de maneira centralizada dependendo assim de estruturas que devem ser acessadas e modificadas a cada passo do processo de agrupamento. Além disso, os resultados gerados por tais métodos são dependentes de informações que devem ser fornecidas a priori como por exemplo número de grupos, tamanho do grupo ou densidade mínima/máxima permitida para o grupo. O presente trabalho visa propor o bee clustering, um algoritmo distribuído inspirado principalmente em técnicas de inteligência de enxames como organização de colônias de abelhas e alocação de tarefas em insetos sociais, desenvolvido com o objetivo de resolver o problema de agrupamento de dados sem a necessidade de pistas sobre o resultado desejado ou inicialização de parâmetros complexos. O bee clustering é capaz de formar grupos de agentes de maneira distribuída, uma necessidade típica em cenários de sistemas multiagente que exijam capacidade de auto-organização sem controle centralizado. Os resultados obtidos mostram que é possível atingir resultados comparáveis as abordagens centralizadas. / Clustering can be defined as a set of techniques that separate a data set into groups of similar objects. Data items within the same group are more similar than objects of different groups. Traditional clustering methods have been usually developed in a centralized fashion. One reason for this is that this form of clustering relies on data structures that must be accessed and modified at each step of the clustering process. Another issue with classical clustering methods is that they need some hints about the target clustering. These hints include for example the number of clusters, the expected cluster size, or the minimum density of clusters. In this work we propose a clustering algorithm that is inspired by swarm intelligence techniques such as the organization of bee colonies and task allocation among social insects. Our proposed algorithm is developed in a decentralized fashion without any initial information about number of classes, number of partitions, and size of partition, and without the need of complex parameters. The bee clustering algorithm is able to form groups of agents in a distributed way, a typical necessity in multiagent scenarios that require self-organization without central control. The performance of our algorithm shows that it is possible to achieve results that are comparable to those from centralized approaches.
|
10 |
Spatially Correlated Data Accuracy Estimation Models in Wireless Sensor NetworksKarjee, Jyotirmoy January 2013 (has links) (PDF)
One of the major applications of wireless sensor networks is to sense accurate and reliable data from the physical environment with or without a priori knowledge of data statistics. To extract accurate data from the physical environment, we investigate spatial data correlation among sensor nodes to develop data accuracy models. We propose three data accuracy models namely Estimated Data Accuracy (EDA) model, Cluster based Data Accuracy (CDA) model and Distributed Cluster based Data Accuracy (DCDA) model with a priori knowledge of data statistics.
Due to the deployment of high density of sensor nodes, observed data are highly correlated among sensor nodes which form distributed clusters in space. We describe two clustering algorithms called Deterministic Distributed Clustering (DDC) algorithm and Spatial Data Correlation based Distributed Clustering (SDCDC) algorithm implemented under CDA model and DCDA model respectively. Moreover, due to data correlation in the network, it has redundancy in data collected by sensor nodes. Hence, it is not necessary for all sensor nodes to transmit their highly correlated data to the central node (sink node or cluster head node). Even an optimal set of sensor nodes are capable of measuring accurate data and transmitting the accurate, precise data to the central node. This reduces data redundancy, energy consumption and data transmission cost to increase the lifetime of sensor networks.
Finally, we propose a fourth accuracy model called Adaptive Data Accuracy (ADA) model that doesn't require any a priori knowledge of data statistics. ADA model can sense continuous data stream at regular time intervals to estimate accurate data from the environment and select an optimal set of sensor nodes for data transmission to the network. Data transmission can be further reduced for these optimal sensor nodes by transmitting a subset of sensor data using a methodology called Spatio-Temporal Data Prediction (STDP) model under data reduction strategies. Furthermore, we implement data accuracy model when the network is under a threat of malicious attack.
|
Page generated in 0.1366 seconds