Global ETD Search

1	Similarity-Driven Cluster Merging Method for Unsupervised Fuzzy Clustering Xiong, Xuejian, Tan, Kian Lee 01 1900 (has links) In this paper, a similarity-driven cluster merging method is proposed for unsupervised fuzzy clustering. The cluster merging method is used to resolve the problem of cluster validation. Starting with an overspecified number of clusters in the data, pairs of similar clusters are merged based on the proposed similarity-driven cluster merging criterion. The similarity between clusters is calculated by a fuzzy cluster similarity matrix, while an adaptive threshold is used for merging. In addition, a modified generalized objective function is used for prototype-based fuzzy clustering. The function includes the p-norm distance measure as well as principal components of the clusters. The number of the principal components is determined automatically from the data being clustered. The performance of this unsupervised fuzzy clustering algorithm is evaluated by several experiments of an artificial data set and a gene expression data set. / Singapore-MIT Alliance (SMA) cluster merging unsupervised fuzzy clustering cluster validity gene expression data
2	Automatizando o processo de estimativa de revocação e precisão de funções de similaridade / Automatizing the process of estimating recall and precision of similarity functions Santos, Juliana Bonato dos January 2008 (has links) Os mecanismos tradicionais de consulta a bases de dados, que utilizam o critério de igualdade, têm se tornado ineficazes quando os dados armazenados possuem variações tanto ortográficas quanto de formato. Nesses casos, torna-se necessário o uso de funções de similaridade ao invés dos operadores booleanos. Os mecanismos de consulta por similaridade retornam um ranking de elementos ordenados pelo seu valor de similaridade em relação ao objeto consultado. Para delimitar os elementos desse ranking que efetivamente fazem parte do resultado pode-se utilizar um limiar de similaridade. Entretanto, a definição do limiar de similaridade adequado é complexa, visto que este valor varia de acordo com a função de similaridade usada e a semântica dos dados consultados. Uma das formas de auxiliar na definição do limiar adequado é avaliar a qualidade do resultado de consultas que utilizam funções de similaridade para diferentes limiares sobre uma amostra da coleção de dados. Este trabalho apresenta um método automático de estimativa da qualidade de funções de similaridade através de medidas de revocação e precisão computadas para diferentes limiares. Os resultados obtidos a partir da aplicação desse método podem ser utilizados como metadados e, a partir dos requisitos de uma aplicação específica, auxiliar na definição do limiar mais adequado. Este processo automático utiliza métodos de agrupamento por similaridade, bem como medidas para validar os grupos formados por esses métodos, para eliminar a intervenção humana durante a estimativa de valores de revocação e precisão. / Traditional database query mechanisms, which use the equality criterion, have become inefficient when the stored data have spelling and format variations. In such cases, it's necessary to use similarity functions instead of boolean operators. Query mechanisms that use similarity functions return a ranking of elements ordered by their score in relation to the query object. To define the relevant elements that must be returned in this ranking, a threshold value can be used. However, the definition of the appropriated threshold value is complex, because it depends on the similarity function used and the semantics of the queried data. One way to help to choose an appropriate threshold is to evaluate the quality of similarity functions results using different thresholds values on a database sample. This work presents an automatic method to estimate the quality of similarity functions through recall and precision measures computed for different thresholds. The results obtained by this method can be used as metadata and, through the requirements of an specific application, assist in setting the appropriated threshold value. This process uses clustering methods and cluster validity measures to eliminate human intervention during the process of estimating recall and precision. Banco : Dados Recuperacao : Informacao Métricas : Similaridade Cluster validity Clustering Similarity functions Recall Precision
3	Automatizando o processo de estimativa de revocação e precisão de funções de similaridade / Automatizing the process of estimating recall and precision of similarity functions Santos, Juliana Bonato dos January 2008 (has links) Os mecanismos tradicionais de consulta a bases de dados, que utilizam o critério de igualdade, têm se tornado ineficazes quando os dados armazenados possuem variações tanto ortográficas quanto de formato. Nesses casos, torna-se necessário o uso de funções de similaridade ao invés dos operadores booleanos. Os mecanismos de consulta por similaridade retornam um ranking de elementos ordenados pelo seu valor de similaridade em relação ao objeto consultado. Para delimitar os elementos desse ranking que efetivamente fazem parte do resultado pode-se utilizar um limiar de similaridade. Entretanto, a definição do limiar de similaridade adequado é complexa, visto que este valor varia de acordo com a função de similaridade usada e a semântica dos dados consultados. Uma das formas de auxiliar na definição do limiar adequado é avaliar a qualidade do resultado de consultas que utilizam funções de similaridade para diferentes limiares sobre uma amostra da coleção de dados. Este trabalho apresenta um método automático de estimativa da qualidade de funções de similaridade através de medidas de revocação e precisão computadas para diferentes limiares. Os resultados obtidos a partir da aplicação desse método podem ser utilizados como metadados e, a partir dos requisitos de uma aplicação específica, auxiliar na definição do limiar mais adequado. Este processo automático utiliza métodos de agrupamento por similaridade, bem como medidas para validar os grupos formados por esses métodos, para eliminar a intervenção humana durante a estimativa de valores de revocação e precisão. / Traditional database query mechanisms, which use the equality criterion, have become inefficient when the stored data have spelling and format variations. In such cases, it's necessary to use similarity functions instead of boolean operators. Query mechanisms that use similarity functions return a ranking of elements ordered by their score in relation to the query object. To define the relevant elements that must be returned in this ranking, a threshold value can be used. However, the definition of the appropriated threshold value is complex, because it depends on the similarity function used and the semantics of the queried data. One way to help to choose an appropriate threshold is to evaluate the quality of similarity functions results using different thresholds values on a database sample. This work presents an automatic method to estimate the quality of similarity functions through recall and precision measures computed for different thresholds. The results obtained by this method can be used as metadata and, through the requirements of an specific application, assist in setting the appropriated threshold value. This process uses clustering methods and cluster validity measures to eliminate human intervention during the process of estimating recall and precision. Banco : Dados Recuperacao : Informacao Métricas : Similaridade Cluster validity Clustering Similarity functions Recall Precision
4	Automatizando o processo de estimativa de revocação e precisão de funções de similaridade / Automatizing the process of estimating recall and precision of similarity functions Santos, Juliana Bonato dos January 2008 (has links) Os mecanismos tradicionais de consulta a bases de dados, que utilizam o critério de igualdade, têm se tornado ineficazes quando os dados armazenados possuem variações tanto ortográficas quanto de formato. Nesses casos, torna-se necessário o uso de funções de similaridade ao invés dos operadores booleanos. Os mecanismos de consulta por similaridade retornam um ranking de elementos ordenados pelo seu valor de similaridade em relação ao objeto consultado. Para delimitar os elementos desse ranking que efetivamente fazem parte do resultado pode-se utilizar um limiar de similaridade. Entretanto, a definição do limiar de similaridade adequado é complexa, visto que este valor varia de acordo com a função de similaridade usada e a semântica dos dados consultados. Uma das formas de auxiliar na definição do limiar adequado é avaliar a qualidade do resultado de consultas que utilizam funções de similaridade para diferentes limiares sobre uma amostra da coleção de dados. Este trabalho apresenta um método automático de estimativa da qualidade de funções de similaridade através de medidas de revocação e precisão computadas para diferentes limiares. Os resultados obtidos a partir da aplicação desse método podem ser utilizados como metadados e, a partir dos requisitos de uma aplicação específica, auxiliar na definição do limiar mais adequado. Este processo automático utiliza métodos de agrupamento por similaridade, bem como medidas para validar os grupos formados por esses métodos, para eliminar a intervenção humana durante a estimativa de valores de revocação e precisão. / Traditional database query mechanisms, which use the equality criterion, have become inefficient when the stored data have spelling and format variations. In such cases, it's necessary to use similarity functions instead of boolean operators. Query mechanisms that use similarity functions return a ranking of elements ordered by their score in relation to the query object. To define the relevant elements that must be returned in this ranking, a threshold value can be used. However, the definition of the appropriated threshold value is complex, because it depends on the similarity function used and the semantics of the queried data. One way to help to choose an appropriate threshold is to evaluate the quality of similarity functions results using different thresholds values on a database sample. This work presents an automatic method to estimate the quality of similarity functions through recall and precision measures computed for different thresholds. The results obtained by this method can be used as metadata and, through the requirements of an specific application, assist in setting the appropriated threshold value. This process uses clustering methods and cluster validity measures to eliminate human intervention during the process of estimating recall and precision. Banco : Dados Recuperacao : Informacao Métricas : Similaridade Cluster validity Clustering Similarity functions Recall Precision
5	Hyperplane Clustering : A New Divisive Clustering Algorithm Yogananda, A P 01 1900 (has links) (PDF) No description available. Clustering Hyperplane Clustering (HC) Clustering Algorithms Cluster Validity Analysis Applied Mathematics
6	Ant Clustering with Consensus Gu, Yuhua 01 April 2009 (has links) Clustering is actively used in several research fields, such as pattern recognition, machine learning and data mining. This dissertation focuses on clustering algorithms in the data mining area. Clustering algorithms can be applied to solve the unsupervised learning problem, which deals with finding clusters in unlabeled data. Most clustering algorithms require the number of cluster centers be known in advance. However, this is often not suitable for real world applications, since we do not know this information in most cases. Another question becomes, once clusters are found by the algorithms, do we believe the clusters are exactly the right ones or do there exist better ones? In this dissertation, we present two new Swarm Intelligence based approaches for data clustering to solve the above issues. Swarm based approaches to clustering have been shown to be able to skip local extrema by doing a form of global search, our two newly proposed ant clustering algorithms take advantage of this. The first algorithm is a kernel-based fuzzy ant clustering algorithm using the Xie-Beni partition validity metric, it is a two stage algorithm, in the first stage of the algorithm ants move the cluster centers in feature space, the cluster centers found by the ants are evaluated using a reformulated kernel-based Xie-Beni cluster validity metric. We found when provided with more clusters than exist in the data our new ant-based approach produces a partition with empty clusters and/or very lightly populated clusters. Then the second stage of this algorithm was applied to automatically detect the number of clusters for a data set by using threshold solutions. The second ant clustering algorithm, using chemical recognition of nestmates is a combination of an ant based algorithm and a consensus clustering algorithm. It is a two-stage algorithm without initial knowledge of the number of clusters. The main contributions of this work are to use the ability of an ant based clustering algorithm to determine the number of cluster centers and refine the cluster centers, then apply a consensus clustering algorithm to get a better quality final solution. We also introduced an ensemble ant clustering algorithm which is able to find a consistent number of clusters with appropriate parameters. We proposed a modified online ant clustering algorithm to handle clustering large data sets. To our knowledge, we are the first to use consensus to combine multiple ant partitions to obtain robust clustering solutions. Experiments were done with twelve data sets, some of which were benchmark data sets, two artificially generated data sets and two magnetic resonance image brain volumes. The results show how the ant clustering algorithms play an important role in finding the number of clusters and providing useful information for consensus clustering to locate the optimal clustering solutions. We conducted a wide range of comparative experiments that demonstrate the effectiveness of the new approaches. Partitioning Fuzzy C means Cluster validity Ant colony Ensemble Non-negative Matrix Factorization American Studies Arts and Humanities
7	A Relational Framework for Clustering and Cluster Validity and the Generalization of the Silhouette Measure Rawashdeh, Mohammad Y. 23 October 2014 (has links) No description available. Computer Science relational framework silhouettes fuzzy clustering cluster validity intracluster intercluster
8	Experiments with K-Means, Fuzzy c-Means and Approaches to Choose K and C Hong, Sui 01 January 2006 (has links) A parameter specifying the number of clusters in an unsupervised clustering algorithm is often unknown. Different cluster validity indices proposed in the past have attempted to address this issue, and their performance is directly related to the accuracy of a clustering algorithm. Toe gap statistic proposed by Tibshirani (2001) was applied to k-means and hierarchical clustering algorithms for estimating the number of clusters and is shown to outperform other cluster validity measures, especially in the null model case. In our experiments, the gap statistic is applied to the Fuzzy c-Means (FCM) algorithm and compared to existing FCM cluster validity indices examined by Pal (1995). A comparison is also made between two initialization methods where centers are randomly assigned to data points or initialized using the furthest first algorithm (Hochbaum, 1985). Toe gap statistic can be applied using the FCM algorithm as long as the fuzzy partition matrix can be employed in computing the gap statistic metric, Wk . Three new methodologies are examined for computing this metric in order to apply the gap statistic to the FCM algorithm. Toe fuzzy partition matrix generated by FCM can also be thresholded based upon the maximum membership to allow computation similar to the kmeans algorithm. This is assumed to be the current method for employing the gap statistic with the FCM algorithm and is compared to the three proposed methods. In our results, the gap statistic outperformed the cluster validity indices for FCM, and one of the new methodologies introduced for computing the metric, based upon the FCM objective function, out performed the threshold method for m=2. FCM; Gap statistic; Machine learning Computer Engineering
9	Reporting and analyzing alternative clustering solutions by employing multi-objective genetic algorithm and conducting experiments on cancer data Peng, P., Addam, O., Elzohbi, M., Ozyer, S., Elhajj, Ahmad, Gao, S., Liu, Y., Ozyer, T., Kaya, M., Ridley, Mick J., Rokne, J., Alhajj, R. 14 November 2013 (has links) No / Clustering is an essential research problem which has received considerable attention in the research community for decades. It is a challenge because there is no unique solution that fits all problems and satisfies all applications. We target to get the most appropriate clustering solution for a given application domain. In other words, clustering algorithms in general need prior specification of the number of clus- ters, and this is hard even for domain experts to estimate especially in a dynamic environment where the data changes and/or become available incrementally. In this paper, we described and analyze the effec- tiveness of a robust clustering algorithm which integrates multi-objective genetic algorithm into a frame- work capable of producing alternative clustering solutions; it is called Multi-objective K-Means Genetic Algorithm (MOKGA). We investigate its application for clustering a variety of datasets, including micro- array gene expression data. The reported results are promising. Though we concentrate on gene expres- sion and mostly cancer data, the proposed approach is general enough and works equally to cluster other datasets as demonstrated by the two datasets Iris and Ruspini. After running MOKGA, a pareto-optimal front is obtained, and gives the optimal number of clusters as a solution set. The achieved clustering results are then analyzed and validated under several cluster validity techniques proposed in the litera- ture. As a result, the optimal clusters are ranked for each validity index. We apply majority voting to decide on the most appropriate set of validity indexes applicable to every tested dataset. The proposed clustering approach is tested by conducting experiments using seven well cited benchmark data sets. The obtained results are compared with those reported in the literature to demonstrate the applicability and effectiveness of the proposed approach. Clustering Genetic algorithm Gene expression Data Multi-objective optimisation Cluster validity analysis
10	Projection separability: A new approach to evaluate embedding algorithms in the geometrical space Acevedo Toledo, Aldo Marcelino 06 February 2024 (has links) Evaluating separability is fundamental to pattern recognition. A plethora of embedding methods, such as dimension reduction and network embedding algorithms, have been developed to reveal the emergence of geometrical patterns in a low-dimensional space, where high-dimensional sample and node similarities are approximated by geometrical distances. However, statistical measures to evaluate the separability attained by the embedded representations are missing. Traditional cluster validity indices (CVIs) might be applied in this context, but they present multiple limitations because they are not specifically tailored for evaluating the separability of embedded results. This work introduces a new rationale called projection separability (PS), which provides a methodology expressly designed to assess the separability of data samples in a reduced (i.e., low-dimensional) geometrical space. In a first case study, using this rationale, a new class of indices named projection separability indices (PSIs) is implemented based on four statistical measures: Mann-Whitney U-test p-value, Area Under the ROC-Curve, Area Under the Precision-Recall Curve, and Matthews Correlation Coefficient. The PSIs are compared to six representative cluster validity indices and one geometrical separability index using seven nonlinear datasets and six different dimension reduction algorithms. In a second case study, the PS rationale is extended to define and measure the geometric separability (linear and nonlinear) of mesoscale patterns in complex data visualization by solving the traveling salesman problem, offering experimental evidence on the evaluation of community separability of network embedding results using eight real network datasets and three network embedding algorithms. The results of both studies provide evidence that the implemented statistical-based measures designed on the basis of the PS rationale are more accurate than the other indices and can be adopted not only for evaluating and comparing the separability of embedded results in the low-dimensional space but also for fine-tuning embedding algorithms’ hyperparameters. Besides these advantages, the PS rationale can be used to design new statistical-based separability measures other than the ones presented in this work, providing the community with a novel and flexible framework for assessing separability. info:eu-repo/classification/ddc/004 ddc:004

Search results