Spelling suggestions: "subject:"4cluster ensembles"" "subject:"4cluster densembles""
1 |
A CLUE for CLUster EnsemblesHornik, Kurt 20 September 2005 (has links) (PDF)
Cluster ensembles are collections of individual solutions to a given clustering problem which are useful or necessary to consider in a wide range of applications. The R package
clue provides an extensible computational environment for creating and analyzing cluster ensembles, with basic data structures for representing partitions and hierarchies,
and facilities for computing on these, including methods for measuring proximity and obtaining consensus and "secondary" clusterings. (author's abstract)
|
2 |
Voting-Based Consensus of Data PartitionsAyad, Hanan 08 1900 (has links)
Over the past few years, there has been a renewed interest in the consensus
problem for ensembles of partitions. Recent work is primarily motivated by the
developments in the area of combining multiple supervised learners. Unlike the
consensus of supervised classifications, the consensus of data partitions is a
challenging problem due to the lack of globally defined cluster labels and to
the inherent difficulty of data clustering as an unsupervised learning problem.
Moreover, the true number of clusters may be unknown. A fundamental goal of
consensus methods for partitions is to obtain an optimal summary of an ensemble
and to discover a cluster structure with accuracy and robustness exceeding those
of the individual ensemble partitions.
The quality of the consensus partitions highly depends on the ensemble
generation mechanism and on the suitability of the consensus method for
combining the generated ensemble. Typically, consensus methods derive an
ensemble representation that is used as the basis for extracting the consensus
partition. Most ensemble representations circumvent the labeling problem. On
the other hand, voting-based methods establish direct parallels with consensus
methods for supervised classifications, by seeking an optimal relabeling of the
ensemble partitions and deriving an ensemble representation consisting of a
central aggregated partition. An important element of the voting-based
aggregation problem is the pairwise relabeling of an ensemble partition with
respect to a representative partition of the ensemble, which is refered to here
as the voting problem. The voting problem is commonly formulated as a weighted
bipartite matching problem.
In this dissertation, a general theoretical framework for the voting problem as
a multi-response regression problem is proposed. The problem is formulated as
seeking to estimate the uncertainties associated with the assignments of the
objects to the representative clusters, given their assignments to the clusters
of an ensemble partition. A new voting scheme, referred to as cumulative voting,
is derived as a special instance of the proposed regression formulation
corresponding to fitting a linear model by least squares estimation. The
proposed formulation reveals the close relationships between the underlying loss
functions of the cumulative voting and bipartite matching schemes. A useful
feature of the proposed framework is that it can be applied to model substantial
variability between partitions, such as a variable number of clusters.
A general aggregation algorithm with variants corresponding to
cumulative voting and bipartite matching is applied and a simulation-based
analysis is presented to compare the suitability of each scheme to different
ensemble generation mechanisms. The bipartite matching is found to be more
suitable than cumulative voting for a particular generation model, whereby each
ensemble partition is generated as a noisy permutation of an underlying
labeling, according to a probability of error. For ensembles with a variable
number of clusters, it is proposed that the aggregated partition be viewed as an
estimated distributional representation of the ensemble, on the basis of which,
a criterion may be defined to seek an optimally compressed consensus partition.
The properties and features of the proposed cumulative voting scheme are
studied. In particular, the relationship between cumulative voting and the
well-known co-association matrix is highlighted. Furthermore, an adaptive
aggregation algorithm that is suited for the cumulative voting scheme is
proposed. The algorithm aims at selecting the initial reference partition and
the aggregation sequence of the ensemble partitions the loss of mutual
information associated with the aggregated partition is minimized. In order to
subsequently extract the final consensus partition, an efficient agglomerative
algorithm is developed. The algorithm merges the aggregated clusters such that
the maximum amount of information is preserved. Furthermore, it allows the
optimal number of consensus clusters to be estimated.
An empirical study using several artificial and real-world datasets demonstrates
that the proposed cumulative voting scheme leads to discovering substantially
more accurate consensus partitions compared to bipartite matching, in the case
of ensembles with a relatively large or a variable number of clusters. Compared
to other recent consensus methods, the proposed method is found to be comparable
with or better than the best performing methods. Moreover, accurate estimates of
the true number of clusters are often achieved using cumulative voting, whereas
consistently poor estimates are achieved based on bipartite matching. The
empirical evidence demonstrates that the bipartite matching scheme is not
suitable for these types of ensembles.
|
3 |
Voting-Based Consensus of Data PartitionsAyad, Hanan 08 1900 (has links)
Over the past few years, there has been a renewed interest in the consensus
problem for ensembles of partitions. Recent work is primarily motivated by the
developments in the area of combining multiple supervised learners. Unlike the
consensus of supervised classifications, the consensus of data partitions is a
challenging problem due to the lack of globally defined cluster labels and to
the inherent difficulty of data clustering as an unsupervised learning problem.
Moreover, the true number of clusters may be unknown. A fundamental goal of
consensus methods for partitions is to obtain an optimal summary of an ensemble
and to discover a cluster structure with accuracy and robustness exceeding those
of the individual ensemble partitions.
The quality of the consensus partitions highly depends on the ensemble
generation mechanism and on the suitability of the consensus method for
combining the generated ensemble. Typically, consensus methods derive an
ensemble representation that is used as the basis for extracting the consensus
partition. Most ensemble representations circumvent the labeling problem. On
the other hand, voting-based methods establish direct parallels with consensus
methods for supervised classifications, by seeking an optimal relabeling of the
ensemble partitions and deriving an ensemble representation consisting of a
central aggregated partition. An important element of the voting-based
aggregation problem is the pairwise relabeling of an ensemble partition with
respect to a representative partition of the ensemble, which is refered to here
as the voting problem. The voting problem is commonly formulated as a weighted
bipartite matching problem.
In this dissertation, a general theoretical framework for the voting problem as
a multi-response regression problem is proposed. The problem is formulated as
seeking to estimate the uncertainties associated with the assignments of the
objects to the representative clusters, given their assignments to the clusters
of an ensemble partition. A new voting scheme, referred to as cumulative voting,
is derived as a special instance of the proposed regression formulation
corresponding to fitting a linear model by least squares estimation. The
proposed formulation reveals the close relationships between the underlying loss
functions of the cumulative voting and bipartite matching schemes. A useful
feature of the proposed framework is that it can be applied to model substantial
variability between partitions, such as a variable number of clusters.
A general aggregation algorithm with variants corresponding to
cumulative voting and bipartite matching is applied and a simulation-based
analysis is presented to compare the suitability of each scheme to different
ensemble generation mechanisms. The bipartite matching is found to be more
suitable than cumulative voting for a particular generation model, whereby each
ensemble partition is generated as a noisy permutation of an underlying
labeling, according to a probability of error. For ensembles with a variable
number of clusters, it is proposed that the aggregated partition be viewed as an
estimated distributional representation of the ensemble, on the basis of which,
a criterion may be defined to seek an optimally compressed consensus partition.
The properties and features of the proposed cumulative voting scheme are
studied. In particular, the relationship between cumulative voting and the
well-known co-association matrix is highlighted. Furthermore, an adaptive
aggregation algorithm that is suited for the cumulative voting scheme is
proposed. The algorithm aims at selecting the initial reference partition and
the aggregation sequence of the ensemble partitions the loss of mutual
information associated with the aggregated partition is minimized. In order to
subsequently extract the final consensus partition, an efficient agglomerative
algorithm is developed. The algorithm merges the aggregated clusters such that
the maximum amount of information is preserved. Furthermore, it allows the
optimal number of consensus clusters to be estimated.
An empirical study using several artificial and real-world datasets demonstrates
that the proposed cumulative voting scheme leads to discovering substantially
more accurate consensus partitions compared to bipartite matching, in the case
of ensembles with a relatively large or a variable number of clusters. Compared
to other recent consensus methods, the proposed method is found to be comparable
with or better than the best performing methods. Moreover, accurate estimates of
the true number of clusters are often achieved using cumulative voting, whereas
consistently poor estimates are achieved based on bipartite matching. The
empirical evidence demonstrates that the bipartite matching scheme is not
suitable for these types of ensembles.
|
4 |
Uma abordagem de múltiplos aspectos para alinhamento de ontologias baseado em Cluster Ensembles Bayesianos. / A multi-aspect approach for ontology matching based on Bayesian Cluster Ensembles.Ippolito, André 22 May 2017 (has links)
Ontologias são especificações formais e explícitas usadas para descrever entidades de um domínio e seus relacionamentos. Estatísticas recentes do projeto Linked Open Data (LOD) indicam a existência de milhares de ontologias heterogêneas publicadas na nuvem do LOD, impondo um desafio para a integração de ontologias. Um passo fundamental na integração é o emparelhamento, processo que obtém elementos correspondentes entre ontologias heterogêneas. Visando superar o desafio de efetuar o emparelhamento em larga escala, desenvolveu-se uma estratégia baseada em clusterização das ontologias, a qual particiona as ontologias em subontologias, clusteriza as subontologias e restringe o processo de emparelhamento aos elementos de um mesmo cluster. Porém, observa-se que as soluções do estado da arte necessitam explorar mais os múltiplos aspectos que as subontologias possuem. As clusterizações de cada aspecto podem ser combinadas, por meio de um consenso. Cluster Ensembles é uma técnica que permite obter esse consenso. Além disso, estudos comparativos indicaram que o uso de Cluster Ensembles Bayesianos (CEB) resulta em uma clusterização de maior acurácia do que a obtida por outras técnicas de Cluster Ensembles. Um dos principais objetivos deste trabalho foi desenvolver uma nova metodologia de emparelhamento de ontologias baseada em clusterização consensual de múltiplos aspectos de comunidades, de forma a estruturar um arcabouço metodológico, por meio do qual diferentes técnicas e aspectos podem ser incorporados e testados. De acordo com a metodologia desenvolvida neste trabalho, inicialmente aplicaram-se técnicas de Detecção de Comunidades para particionar as ontologias. Em seguida, consideraram-se os seguintes aspectos das comunidades obtidas: terminológico, estrutural e extensional. Fez-se, separadamente, a clusterização das comunidades segundo cada aspecto e aplicaram-se diferentes técnicas de clusterização consensual para obter um consenso entre as clusterizações de cada aspecto: CEB, técnicas baseadas em similaridades e técnicas baseadas em métodos diretos. Para os diferentes consensos, o processo de emparelhamento foi feito apenas entre elementos das ontologias que pertencessem a um mesmo cluster consensual. As soluções consensuais destacaram-se nos estudos de caso efetuados quanto à precisão e cobertura dos alinhamentos, enquanto a solução baseada no aspecto terminológico destacou-se quanto ao valor de F-measure. A principal contribuição deste trabalho relaciona-se à metodologia desenvolvida, que constitui um arcabouço metodológico, por meio do qual diferentes aspectos e técnicas podem ser incorporados e testados quanto ao seu desempenho de clusterização e de alinhamento de ontologias. / Ontologies are formal and explicit specifications used to describe entities of a domain and its relationships. Recent statistics of the Linked Open Data (LOD) project indicate the existence of thousands of heterogeneous ontologies in the LOD cloud, posing a challenge to ontology integration. A fundamental step in integration is matching, a process that finds correspondent elements between heterogeneous ontologies. Aiming to overcome the challenge of large-scale ontology matching, researchers developed a strategy based on clustering, which divides ontologies into subontologies, clusters subontologies and restricts the matching process to elements of the same cluster. However, state-of-the-art solutions need to explore more the multiple aspects that subontologies have. Clustering solutions of each aspect can be combined, by means of a consensus. Cluster Ensembles is a technique that allows obtaining this consensus. Besides, comparative studies indicated that Bayesian Cluster Ensembles has higher clustering accuracy than other Cluster Ensembles techniques. One of the main goals of this work was to develop a new methodology for ontology matching based on consensus clustering of multiple aspects of communities, structuring a methodological framework that enables the use and tests of different techniques and aspects. According to the methodology adopted in this work, initially, Community Detection techniques were applied to partition the ontologies. In the sequence, the following aspects of the communities were considered: terminological, structural and extensional. Clustering according to each aspect was performed separately and different consensus clustering techniques were applied to obtain a consensus among clustering solutions of each aspect: Bayesian Cluster Ensembles, techniques based on similarities and techniques based on direct methods. For the different consensuses, matching was done only between elements of the two ontologies that belonged to the same consensual cluster. For the case studies applied in this work, the consensual solutions were a standout in precision and recall, while the terminological-based solution was a standout in F-measure. The main contribution of this work is related to the developed methodology, which constitutes a methodological framework, through which different aspects and techniques can be incorporated and tested concerning their ontology clustering and alignment performance.
|
5 |
Uma abordagem de múltiplos aspectos para alinhamento de ontologias baseado em Cluster Ensembles Bayesianos. / A multi-aspect approach for ontology matching based on Bayesian Cluster Ensembles.André Ippolito 22 May 2017 (has links)
Ontologias são especificações formais e explícitas usadas para descrever entidades de um domínio e seus relacionamentos. Estatísticas recentes do projeto Linked Open Data (LOD) indicam a existência de milhares de ontologias heterogêneas publicadas na nuvem do LOD, impondo um desafio para a integração de ontologias. Um passo fundamental na integração é o emparelhamento, processo que obtém elementos correspondentes entre ontologias heterogêneas. Visando superar o desafio de efetuar o emparelhamento em larga escala, desenvolveu-se uma estratégia baseada em clusterização das ontologias, a qual particiona as ontologias em subontologias, clusteriza as subontologias e restringe o processo de emparelhamento aos elementos de um mesmo cluster. Porém, observa-se que as soluções do estado da arte necessitam explorar mais os múltiplos aspectos que as subontologias possuem. As clusterizações de cada aspecto podem ser combinadas, por meio de um consenso. Cluster Ensembles é uma técnica que permite obter esse consenso. Além disso, estudos comparativos indicaram que o uso de Cluster Ensembles Bayesianos (CEB) resulta em uma clusterização de maior acurácia do que a obtida por outras técnicas de Cluster Ensembles. Um dos principais objetivos deste trabalho foi desenvolver uma nova metodologia de emparelhamento de ontologias baseada em clusterização consensual de múltiplos aspectos de comunidades, de forma a estruturar um arcabouço metodológico, por meio do qual diferentes técnicas e aspectos podem ser incorporados e testados. De acordo com a metodologia desenvolvida neste trabalho, inicialmente aplicaram-se técnicas de Detecção de Comunidades para particionar as ontologias. Em seguida, consideraram-se os seguintes aspectos das comunidades obtidas: terminológico, estrutural e extensional. Fez-se, separadamente, a clusterização das comunidades segundo cada aspecto e aplicaram-se diferentes técnicas de clusterização consensual para obter um consenso entre as clusterizações de cada aspecto: CEB, técnicas baseadas em similaridades e técnicas baseadas em métodos diretos. Para os diferentes consensos, o processo de emparelhamento foi feito apenas entre elementos das ontologias que pertencessem a um mesmo cluster consensual. As soluções consensuais destacaram-se nos estudos de caso efetuados quanto à precisão e cobertura dos alinhamentos, enquanto a solução baseada no aspecto terminológico destacou-se quanto ao valor de F-measure. A principal contribuição deste trabalho relaciona-se à metodologia desenvolvida, que constitui um arcabouço metodológico, por meio do qual diferentes aspectos e técnicas podem ser incorporados e testados quanto ao seu desempenho de clusterização e de alinhamento de ontologias. / Ontologies are formal and explicit specifications used to describe entities of a domain and its relationships. Recent statistics of the Linked Open Data (LOD) project indicate the existence of thousands of heterogeneous ontologies in the LOD cloud, posing a challenge to ontology integration. A fundamental step in integration is matching, a process that finds correspondent elements between heterogeneous ontologies. Aiming to overcome the challenge of large-scale ontology matching, researchers developed a strategy based on clustering, which divides ontologies into subontologies, clusters subontologies and restricts the matching process to elements of the same cluster. However, state-of-the-art solutions need to explore more the multiple aspects that subontologies have. Clustering solutions of each aspect can be combined, by means of a consensus. Cluster Ensembles is a technique that allows obtaining this consensus. Besides, comparative studies indicated that Bayesian Cluster Ensembles has higher clustering accuracy than other Cluster Ensembles techniques. One of the main goals of this work was to develop a new methodology for ontology matching based on consensus clustering of multiple aspects of communities, structuring a methodological framework that enables the use and tests of different techniques and aspects. According to the methodology adopted in this work, initially, Community Detection techniques were applied to partition the ontologies. In the sequence, the following aspects of the communities were considered: terminological, structural and extensional. Clustering according to each aspect was performed separately and different consensus clustering techniques were applied to obtain a consensus among clustering solutions of each aspect: Bayesian Cluster Ensembles, techniques based on similarities and techniques based on direct methods. For the different consensuses, matching was done only between elements of the two ontologies that belonged to the same consensual cluster. For the case studies applied in this work, the consensual solutions were a standout in precision and recall, while the terminological-based solution was a standout in F-measure. The main contribution of this work is related to the developed methodology, which constitutes a methodological framework, through which different aspects and techniques can be incorporated and tested concerning their ontology clustering and alignment performance.
|
6 |
Visual analytics for detection and assessment of process-related patterns in geoscientific spatiotemporal dataKöthur, Patrick 04 January 2016 (has links)
Diese Arbeit untersucht, inwiefern Visual Analytics die Analyse von Prozessen in geowissenschaftlichen raum-zeitlichen Daten unterstützen kann. Hierzu wurden drei neuartige Visual Analytics Ansätze entwickelt. Jeder Ansatz addressiert eine wichtige Analyseperspektive. Der erste Ansatz erlaubt es, wichtige räumliche Zustände in den Daten sowie deren auftreten in der Zeit zu untersuchen. Mittels hierarchischem Clustering werden alle in den Daten enthaltenen räumlichen Zustände in einer Clusterhierarchie verortet. Interaktive visuelle Analyse ermöglicht es, verschiedene räumliche Zustände aus den Daten zu extrahieren und die dazugehörigen raum-zeitlichen Muster zu interpretieren und zu bewerten. Der zweite Ansatz unterstützt die systematische Analyse des in den Daten zu beobachtenden zeitlichen Verhaltens sowie dessen Auftreten im geographischen Raum mittels einer Kombination aus Cluster Ensembles und interaktiver visueller Exploration. Der dritte Ansatz gestattet die Detektion und Analyse von zeitlichen Zusammenhängen in den Daten. Hierzu wurde eine etablierte Methode zur Analyse von zeitlichen Zusammenhängen zwischen zwei einzelnen Zeitreihen, gefensterte Kreuzkorrelation, durch Visual Analytics auf den Vergleich von Zeitreihenensembles erweitert. Dadurch ist es nicht nur möglich, Zusammenhänge zwischen Zeitreihen zu untersuchen, sondern auch Unsicherheiten in den Daten zu berücksichtigen. Alle Ansätze wurden anhand einer nutzer- und aufgabenorientierten Methodik entwickelt und erfolgreich in Anwendungsfällen aus der Erdsystem-Modellierung, der Ozeanmodellierung, der Paläoklimatologie und sogar den Kognitionswissenschaften eingesetzt. Diese Dissertation zeigt, dass Visual Analytics einen wertvollen Ansatz zur Analyse von Prozess-bezogenen Mustern in raum-zeitlichen Daten darstellt. Es kann die Grenzen existierender Analysemethoden erweitern und ermöglicht Geowissenschaftlern neue, aufschlussreiche Sichtweisen auf Daten und die darin beschriebenen Prozesse. / This thesis studied how visual analytics can facilitate the analysis of processes in geoscientific spatiotemporal data. Three novel visual analytics solutions were developed, each addressing an important analysis perspective. The first solution addresses the analysis of prominent spatial situations in the data and their occurrence over time. Hierarchical clustering is used to arrange all spatial situations in the data in a hierarchy of clusters. The combination with interactive visual analysis enables geoscientists to explore and alter the resulting hierarchy, to extract different sets of representative spatial situations, and to interpret and assess the corresponding spatiotemporal patterns. The second solution supports geoscientists in the analysis of prominent types of temporal behavior and their location in geographic space. Cluster ensembles are integrated with interactive visual exploration to enable users to systematically detect and interpret various types of temporal behavior in different data sets and to use this information for assessment of simulation model output. The third solution enables geoscientists to detect and analyze interrelations of temporal behavior in the data. Windowed cross-correlation, a technique for comparison of two individual time series, was extended to the comparison of entire ensembles of time series through visual analytics. This not only allows scientists to study interrelations, but also to assess how much these interrelations vary between two ensembles. All visual analytics solutions were developed following a rigorous user- and task-centered methodology and successfully applied to use cases in Earth system modeling, ocean modeling, paleoclimatology, and even cognitive science. The results of this thesis demonstrate that visual analytics successfully addresses important analysis perspectives and that it is a valuable approach to the analysis of process-related patterns in geoscientific spatiotemporal data.
|
7 |
Agrupamento nebuloso de dados baseado em enxame de partículas: seleção por métodos evolutivos e combinação via relação nebulosa do tipo-2Szabo, Alexandre 29 October 2014 (has links)
Made available in DSpace on 2016-03-15T19:38:52Z (GMT). No. of bitstreams: 1
Alexandre Szabo.pdf: 2177168 bytes, checksum: 8b503cd1beb4c700f1905e07a0b08362 (MD5)
Previous issue date: 2014-10-29 / Fundação de Amparo a Pesquisa do Estado de São Paulo / Clustering usually treats objects as belonging to mutually exclusive clusters, what is usually im-precise, because an object may belong to more than one cluster simultaneously with different membership degrees. The clustering algorithms, both crisp and fuzzy, have a number of parameters to be adjusted so that they present the best performance for a given database. Furthermore, it is known that no single algorithm is better than all the others for all problem classes, and the combi-nation of solutions found by various algorithms (or the same algorithm with different parameters) may lead to a global solution that is better than those found by individual algorithms, including the best one. It is within this context that the present thesis proposes a new fuzzy clustering algo-rithm inspired by the behavior of particle swarms and, then, introduces a new form of combining the clustering algorithms using concepts from Type-2 fuzzy sets. / Da maneira tradicional o agrupamento trata os objetos que compõem a base como pertencentes a grupos mutuamente exclusivos, o que nem sempre é verdade, pois um objeto pode pertencer a mais de um grupo com diferentes graus de pertinência. Os algoritmos de agrupamento, sejam eles convencionais ou nebulosos (capazes de tratar múltiplas pertinências simultaneamente), possuem diversos parâmetros a serem ajustados de tal forma que ofereçam o melhor desempenho para uma base de dados. Além disso, é sabido que nenhum algoritmo é superior a todos os outros para todas as classes de problemas e que combinar soluções fornecidas por diferentes algoritmos pode levar a uma solução global superior a todas as soluções individuais, inclusive à melhor. É nesse contexto que a presente tese propõe um novo algoritmo de agrupamento nebuloso de dados inspirado no comportamento de enxames de partículas e, em seguida, propõe uma nova forma de realizar combinações (ensembles) de algoritmos de agrupamento usando conceitos da teoria de conjuntos nebulosos do Tipo-2.
|
Page generated in 0.0554 seconds