Spelling suggestions: "subject:"clustering,"" "subject:"klustering,""
11 |
Mécanismes pour la cohérence, l'atomicité et les communications au niveau des clusters : application au clustering hiérarchique distribué adaptatif / Mechanism for coherence, atomicity and communications at clusters level : application to adaptative distributed hierarchical clusteringAvril, François 29 September 2015 (has links)
Nous nous intéressons dans cette thèse à l'organisation des systèmes distribués dynamiquesde grande taille : ensembles de machines capables de communiquer entre elles et pouvant à toutinstant se connecter ou se déconnecter. Nous proposons de partitionner le système en groupesconnexes, appelés clusters. Afin d'organiser des réseaux de grande taille, nous construisons unestructure hiérarchique imbriquée dans laquelle les clusters d'un niveau sont regroupés au seinde clusters du niveau supérieur. Pour mener à bien ce processus, nous mettons en place desmécanismes permettant aux clusters d'être les noeuds d'un nouveau système distribué exécutantl'algorithme de notre choix. Cela nécessite en particulier des mécanismes assurant la cohérence decomportement pour le niveau supérieur au sein de chaque cluster. En permettant aux clusters deconstituer un nouveau système distribué exécutant notre algorithme de clustering, nous construisonsune hiérarchie de clusters par une approche ascendante. Nous démontrons cet algorithme endéfinissant formellement le système distribué des clusters, et en démontrant que chaque exécutionde notre algorithme induit sur ce système une exécution de l'algorithme de niveau supérieur. Celanous permet, en particulier, de démontrer par récurrence que nous calculons bien un clusteringhiérarchique imbriqué. Enfin, nous appliquons cette démarche à la résolution des collisions dansles réseaux de capteurs. Pour éviter ce phénomène, nous proposons de calculer un clusteringadapté du système, qui nous permet de calculer un planning organisant les communications ausein du réseau et garantissant que deux messages ne seront jamais émis simultanément dans laportée de communication de l'un des capteurs / To manage and handle large scale distributed dynamic distributed systems, constitutedby communicating devices that can connect or disconnect at any time, we propose to computeconnected subgraphs of the system, called clusters. We propose to compute a hierarchical structure,in which clusters of a level are grouped into clusters of the higher level. To achieve this goal,we introduce mechanisms that allow clusters to be the nodes of a distinct distributed system,that executes an algorithm. In particular, we need mechanisms to maintain the coherence of thebehavior among the nodes of a cluster regarding the higher level. By allowing clusters to be nodesof a distributed system that executes a clustering algorithm, we compute a nested hierarchicalclustering by a bottom-up approach. We formally define the distributed system of clusters, andprove that any execution of our algorithm induces an execution of the higher level algorithm onthe distributed system of clusters. Then, we prove by induction that our algorithm computes anested hierarchical clustering of the system. Last, we use this approach to solve a problem thatappears in sensor networks : collision. To avoid collisions, we propose to compute a clusteringof the system. This clustering is then used to compute a communication schedule in which twomessages cannot be sent at the same time in the range of a sensor
|
12 |
Variable Selection Methods for Model-based Clustering and Application to High-dimensional DataXu, Jini January 2022 (has links)
Clustering helps in understanding the natural grouping and internal structure of data. Model-based clustering considers each cluster as a component in a mixture model. As the data dimensionality and complexity increase, model-based clustering tends to over-parametrize results. Thus, it is important to select a subset of critical variables instead of using all the variables for clustering. This study considers two variable selection methods for model-based clustering on real world high-dimensional data; variable selection for clustering and classification (VSCC) and variable selection for model-based clustering (clustvarsel). For simplicity, Gaussian mixture models were applied. Three criteria are used to compare the clustering accuracy and efficiency, which are the adjusted rand index (ARI), mis-clustering error, and performance time (in seconds). / Thesis / Master of Science (MSc)
|
13 |
Multi-Domain Clustering on Real-Valued DatasetsHu, Zhen 23 September 2011 (has links)
No description available.
|
14 |
An algorithm for identifying clusters of functionally related genes in genomesYi, Gang Man 15 May 2009 (has links)
An increasing body of literature shows that genomes of eukaryotes can contain
clusters of functionally related genes. Most approaches to identify gene clusters utilize
microarray data or metabolic pathway databases to find groups of genes on chromo-
somes that are linked by common attributes. A generalized method that can find
gene clusters, regardless of the mechanism of origin, would provide researchers with
an unbiased method for finding clusters and studying the evolutionary forces that
give rise to them.
I present a basis of algorithm to identify gene clusters in eukaryotic genomes
that utilizes functional categories defined in graph-based vocabularies such as the
Gene Ontology (GO). Clusters identified in this manner need only have a common
function and are not constrained by gene expression or other properties. I tested the
algorithm by analyzing genomes of a representative set of species. I identified species
specific variation in percentage of clustered genes as well as in properties of gene
clusters, including size distribution and functional annotation. These properties may
be diagnostic of the evolutionary forces that lead to the formation of gene clusters.
The approach finds all gene clusters in the data set and ranks them by their likelihood
of occurrence by chance. The method successfully identified clusters.
|
15 |
Feature Translation-based Multilingual Document Clustering TechniqueLiao, Shan-Yu 08 August 2006 (has links)
Document clustering automatically organizes a document collection into distinct groups of similar documents on the basis of their contents. Most of existing document clustering techniques deal with monolingual documents (i.e., documents written in one language). However, with the trend of globalization and advances in Internet technology, an organization or individual often generates/acquires and subsequently archives documents in different languages, thus creating the need for multilingual document clustering (MLDC). Motivated by its significance and need, this study designs a translation-based MLDC technique. Our empirical evaluation results show that the proposed multilingual document clustering technique achieves satisfactory clustering effectiveness measured by both cluster recall and cluster precision.
|
16 |
On the evaluation of clustering results: measures, ensembles, and gene expression data analysis / Sobre a avaliação de resultados de agrupamento: medidas, comitês e análise de dados de expressão gênicaPablo Andretta Jaskowiak 27 November 2015 (has links)
Clustering plays an important role in the exploratory analysis of data. Its goal is to organize objects into a finite set of categories, i.e., clusters, in the hope that meaningful and previously unknown relationships will emerge from the process. Not every clustering result is meaningful, though. In fact, virtually all clustering algorithms will yield a result, even if the data under analysis has no true clusters. If clusters do exist, one still has to determine the best configuration of parameters for the clustering algorithm in hand, in order to avoid poor outcomes. This selection is usually performed with the aid of clustering validity criteria, which evaluate clustering results in a quantitative fashion. In this thesis we study the evaluation/validation of clustering results, proposing, in a broad context, measures and relative validity criteria ensembles. Regarding measures, we propose the use of the Area Under the Curve (AUC) of the Receiver Operating Characteristics (ROC) curve as a relative validity criterion for clustering. Besides providing an empirical evaluation of AUC, we theoretically explore some of its properties and its relation to another measure, known as Gamma. A relative criterion for the validation of density based clustering results, proposed with the participation of the author of this thesis, is also reviewed. In the case of ensembles, we propose their use as means to avoid the evaluation of clustering results based on a single, ad-hoc selected, measure. In this particular scope, we: (i) show that ensembles built on the basis of arbitrarily selected members have limited practical applicability; and (ii) devise a simple, yet effective heuristic approach to select ensemble members, based on their effectiveness and complementarity. Finally, we consider clustering evaluation in the specific context of gene expression data. In this particular case we evaluate the use of external information from the Geno Ontology for the evaluation of distance measures and clustering results / Técnicas de agrupamento desempenham um papel fundamental na análise exploratória de dados. Seu objetivo é a organização de objetos em um conjunto finito de categorias, i.e., grupos (clusters), na expectativa de que relações significativas entre objetos resultem do processo. Nem todos resultados de agrupamento são relevantes, entretanto. De fato, a vasta maioria dos algoritmos de agrupamento existentes produzirá um resultado (partição), mesmo em casos para os quais não existe uma estrutura real de grupos nos dados. Se grupos de fato existem, a determinação do melhor conjunto de parâmetros para estes algoritmos ainda é necessária, a fim de evitar a utilização de resultados espúrios. Tal determinação é usualmente feita por meio de critérios de validação, os quais avaliam os resultados de agrupamento de forma quantitativa. A avaliação/validação de resultados de agrupamentos é o foco desta tese. Em um contexto geral, critérios de validação relativos e a combinação dos mesmos (ensembles) são propostas. No que tange critérios, propõe-se o uso da área sob a curva (AUC Area Under the Curve) proveniente de avaliações ROC (Receiver Operating Characteristics) como um critério de validação relativo no contexto de agrupamento. Além de uma avaliação empírica da AUC, são exploradas algumas de suas propriedades teóricas, bem como a sua relação com outro critério relativo existente, conhecido como Gamma. Ainda com relação à critérios, um índice relativo para a validação de resultados de agrupamentos baseados em densidade, proposto com a participação do autor desta tese, é revisado. No que diz respeito à combinação de critérios, mostra-se que: (i) combinações baseadas em uma seleção arbitrária de índices possuem aplicação prática limitada; e (ii) com o uso de heurísticas para seleção de membros da combinação, melhores resultados podem ser obtidos. Finalmente, considera-se a avaliação/validação no contexto de dados de expressão gênica. Neste caso particular estuda-se o uso de informação da Gene Ontology, na forma de similaridades semânticas, na avaliação de medidas de dissimilaridade e resultados de agrupamentos de genes.
|
17 |
Propagation du buzz sur Internet -- Identification, analyse, modélisation et représentation dans un contexte de veille / Buzz lifecyle on the Web -- Identification, analysis, modelization and representation in the context of strategic and competitive intelligenceLauf, Aurélien 14 October 2014 (has links)
S’inscrivant dans un contexte de veille et d’intelligence d’entreprise sur Internet, l’objectif de cette thèse est d’élaborer des outils et des méthodes permettant d’identifier, analyser, modéliser et représenter le cheminement des buzz sur Internet. Tout buzz a un ou plusieurs points d’origine : les sources primaires. L’information est ensuite relayée par des sources secondaires qui vont accélérer ou non la propagation en fonction de leur degré d’influence. Tout au long du cycle de vie du buzz, le contenu sémantique est amené à évoluer. La compréhension d’un buzz sur Internet passe ainsi par l’analyse de ce qui se dit et la qualification des émetteurs. Nos travaux s’axeront donc autour de deux types d’analyses complémentaires : une analyse topologique des sources (théorie des graphes et des réseaux) et une analyse du contenu textuel (linguistique de corpus). / This thesis is in the context of strategic and competitive intelligence. Its goal is to develop tools and methods to identify, analyze, model and represent how buzz spread on the Internet. Any buzz has one or more starting point(s), i.e. primary source(s). The information is then passed on by secondary sources which may speed or slow down its spreading depending on their influence. Throughout the buzz lifecycle, the semantic content can evolve. To understand a buzz on the Internet, one needs to analyze what is said and qualify who speaks. This thesis will focus on two main points : a topological analysis of the sources (graph theory and networks), and an analysis of the textual content (corpus linguistics).
|
18 |
On the evaluation of clustering results: measures, ensembles, and gene expression data analysis / Sobre a avaliação de resultados de agrupamento: medidas, comitês e análise de dados de expressão gênicaJaskowiak, Pablo Andretta 27 November 2015 (has links)
Clustering plays an important role in the exploratory analysis of data. Its goal is to organize objects into a finite set of categories, i.e., clusters, in the hope that meaningful and previously unknown relationships will emerge from the process. Not every clustering result is meaningful, though. In fact, virtually all clustering algorithms will yield a result, even if the data under analysis has no true clusters. If clusters do exist, one still has to determine the best configuration of parameters for the clustering algorithm in hand, in order to avoid poor outcomes. This selection is usually performed with the aid of clustering validity criteria, which evaluate clustering results in a quantitative fashion. In this thesis we study the evaluation/validation of clustering results, proposing, in a broad context, measures and relative validity criteria ensembles. Regarding measures, we propose the use of the Area Under the Curve (AUC) of the Receiver Operating Characteristics (ROC) curve as a relative validity criterion for clustering. Besides providing an empirical evaluation of AUC, we theoretically explore some of its properties and its relation to another measure, known as Gamma. A relative criterion for the validation of density based clustering results, proposed with the participation of the author of this thesis, is also reviewed. In the case of ensembles, we propose their use as means to avoid the evaluation of clustering results based on a single, ad-hoc selected, measure. In this particular scope, we: (i) show that ensembles built on the basis of arbitrarily selected members have limited practical applicability; and (ii) devise a simple, yet effective heuristic approach to select ensemble members, based on their effectiveness and complementarity. Finally, we consider clustering evaluation in the specific context of gene expression data. In this particular case we evaluate the use of external information from the Geno Ontology for the evaluation of distance measures and clustering results / Técnicas de agrupamento desempenham um papel fundamental na análise exploratória de dados. Seu objetivo é a organização de objetos em um conjunto finito de categorias, i.e., grupos (clusters), na expectativa de que relações significativas entre objetos resultem do processo. Nem todos resultados de agrupamento são relevantes, entretanto. De fato, a vasta maioria dos algoritmos de agrupamento existentes produzirá um resultado (partição), mesmo em casos para os quais não existe uma estrutura real de grupos nos dados. Se grupos de fato existem, a determinação do melhor conjunto de parâmetros para estes algoritmos ainda é necessária, a fim de evitar a utilização de resultados espúrios. Tal determinação é usualmente feita por meio de critérios de validação, os quais avaliam os resultados de agrupamento de forma quantitativa. A avaliação/validação de resultados de agrupamentos é o foco desta tese. Em um contexto geral, critérios de validação relativos e a combinação dos mesmos (ensembles) são propostas. No que tange critérios, propõe-se o uso da área sob a curva (AUC Area Under the Curve) proveniente de avaliações ROC (Receiver Operating Characteristics) como um critério de validação relativo no contexto de agrupamento. Além de uma avaliação empírica da AUC, são exploradas algumas de suas propriedades teóricas, bem como a sua relação com outro critério relativo existente, conhecido como Gamma. Ainda com relação à critérios, um índice relativo para a validação de resultados de agrupamentos baseados em densidade, proposto com a participação do autor desta tese, é revisado. No que diz respeito à combinação de critérios, mostra-se que: (i) combinações baseadas em uma seleção arbitrária de índices possuem aplicação prática limitada; e (ii) com o uso de heurísticas para seleção de membros da combinação, melhores resultados podem ser obtidos. Finalmente, considera-se a avaliação/validação no contexto de dados de expressão gênica. Neste caso particular estuda-se o uso de informação da Gene Ontology, na forma de similaridades semânticas, na avaliação de medidas de dissimilaridade e resultados de agrupamentos de genes.
|
19 |
Una metodología para enfrentar el dinamismo de atributos en clusteringBarrera Aylwin, Sergio Benito January 2017 (has links)
Magíster en Gestión de Operaciones. Ingeniero Civil Industrial / En este trabajo se desarrollar una metodología para enfrentar el problema
de clustering cuando alguno de los atributos se encuentra incompleto y se
va completando en forma dinámica. Y se implementa dicha metodología en
un modelo particular. El modelo implementado en este trabajo se basa en el
modelo de projected clustering (Proclus) desarrollado por Aggarwal et al. en
1999.
Al problema de dinamismo se le agregan las siguientes restricciones: La imposibilidad
de imputar los valores faltantes (los que todavía no llegan) al igual
que la imposibilidad de marginalizar las las con dichos valores faltantes. Estas
restricciones se imponen ya que de lo contrario el problema se puede resolver
en fácilmente de forma estática y/o tiene soluciones dinámicas conocidas.
Se modificó el modelo de proyected clustering para considerar las restricciones
impuestas al igual que implementar el dinamismo buscado. Para evaluar
el modelo se generaron datos de forma sintética (95000 filas), con diferentes
instancias en las que se buscan generar distintos escenarios donde la estructura
de los clusters cambia a medida que los nuevos datos llegan. La generación
sintética permitió evaluar los resultados y observar la evolución en la detección
de las dimensiones y los clusters.
Dado el modelo base escogido dicha modificación manifiesta alguna de sus
mismas limitaciones, como es el caso de necesitar un número elevado de dimensiones.
Los resultados entregados por la implementación del modelo fueron satisfactorios.
Encontrando las soluciones esperadas después de un número razonable
de iteraciones y realizado las operaciones en un tiempo menor que la aplicación
estática del modelo tras la llegada de cada lote de datos. De igual forma se
generó una medida para analizar y/o detectar los cambios en la estructura de
los clusters a medida que llegan los datos de la nueva columna.
Finalmente, en relación a los objetivos planteados en este trabajo, se puede
concluir que el modelo desarrollado logra cumplir con los objetivos planteados,
logrando desarrollar un modelo y metodológica que enfrente en forma efectiva
el problema antes descrito al igual que el aplicarlo a datos simulados y analizar
dichos resultados.
|
20 |
Diarizace meetingové řeči - Kdo mluví kdy / Speaker Diarization of Meeting DataTůma, Radovan Unknown Date (has links)
This work is trying to propose Diarization System based on Bayesian Information Criterion (BIC). In this paper is possible to find description of background theory and short description of previously used systems. Idea of this work is to try to use methods proposed earlier in a faster and more reliable way. Proposed system was tested on some records to prove its error rate. Results of tests are not very good but some possible improvements are proposed.
|
Page generated in 0.1007 seconds