341 |
Robust clustering algorithmsGupta, Pramod 05 April 2011 (has links)
One of the most widely used techniques for data clustering is agglomerative clustering. Such algorithms have been long used across any different fields ranging from computational biology to social sciences to computer vision in part because they are simple and their output is easy to interpret. However, many of these algorithms lack any performance guarantees when the data is noisy, incomplete or has outliers, which is the case for most real world data. It is well known that standard linkage algorithms perform extremely poorly in presence of noise. In this work we propose two new robust algorithms for bottom-up agglomerative clustering and give formal theoretical guarantees for their robustness. We show that our algorithms can be used to cluster accurately in cases where the data satisfies a number of natural properties and where the traditional agglomerative algorithms fail. We also extend our algorithms to an inductive setting with similar guarantees, in which we randomly choose a small subset of points from a much larger instance space and generate a hierarchy over this sample and then insert the rest of the points to it to generate a hierarchy over the entire instance space. We then do a systematic experimental analysis of various linkage algorithms and compare their performance on a variety of real world data sets and show that our algorithms do much better at handling various forms of noise as compared to other hierarchical algorithms in the presence of noise.
|
342 |
Functional Characterization of the NSF1 (YPL230W) Gene using Correlation Clustering and Genetic Analysis in Saccharomyces CerevisiaeBessonov, Kyrylo 09 January 2012 (has links)
High throughput technologies such as microarrays and modern genome sequencers produce enormous amounts of data that require novel data processing. This thesis proposes a method called Interdependent Correlation Cluster (ICC) to analyze the relations between genes represented by microarray data that are conditioned on a specific target gene. Based on Correlation Clustering, the proposed method analyzes a large set of correlation values related to the gene expression profiles extracted from given microarray datasets. The proposed method works on any size microarray datasets and could be applied to any target gene. In this study the selected target gene, NSF1 /USV1 / YPL230W, encodes a poorly characterized C2H2 zinc finger transcription factor (TF) involved in stress responses in yeast. The method is successful in the identification of novel NSF1 functional roles during fermentation stress conditions in the M2 industrial yeast strain. The new identified functions include regulation of energy and sulfur metabolism, protein synthesis, ribosomal assembly and protein trafficking as well as other processes. NSF1 involvement in sulfur metabolism was experimentally confirmed using biological laboratory techniques. Importantly, implication of NSF1 in sulfur metabolism regulation has highly relevant implications to wine and beer production industries concerned with production of compounds having sulfur-like off odour (SLO) and toxic properties. The correlation clustering also provides a means of understanding complex interactions existing between genes. / The pdf file contains numerous hyperlinks and bookmarks to facilitate navigation. This thesis will be of interest to those working with topics such as data mining of microarray data, novel gene function discovery and prediction, and genome-wide responses to fermentation stresses. / Ministry of Training, Colleges and Universities of Ontario (Ontario Graduate Scholarship and Ontario Graduate Scholarships in Science and Technology); The Natural Sciences and Engineering Research Council of Canada (NSERC)
|
343 |
Improving Search Results with Automated Summarization and Sentence ClusteringCotter, Steven 23 March 2012 (has links)
Have you ever searched for something on the web and been overloaded with irrelevant results? Many search engines tend to cast a very wide net and rely on ranking to show you the relevant results first. But, this doesn't always work. Perhaps the occurrence of irrelevant results could be reduced if we could eliminate the unimportant content from each webpage while indexing. Instead of casting a wide net, maybe we can make the net smarter. Here, I investigate the feasibility of using automated document summarization and clustering to do just that. The results indicate that such methods can make search engines more precise, more efficient, and faster, but not without costs. / McAnulty College and Graduate School of Liberal Arts / Computational Mathematics / MS / Thesis
|
344 |
Analyse des différences dans le Big Data : Exploration, Explication, Évolution / Difference Analysis in Big Data : Exploration, Explanation, EvolutionKleisarchaki, Sofia 28 November 2016 (has links)
La Variabilité dans le Big Data se réfère aux données dont la signification change de manière continue. Par exemple, les données des plateformes sociales et les données des applications de surveillance, présentent une grande variabilité. Cette variabilité est dûe aux différences dans la distribution de données sous-jacente comme l’opinion de populations d’utilisateurs ou les mesures des réseaux d’ordinateurs, etc. L’Analyse de Différences a comme objectif l’étude de la variabilité des Données Massives. Afin de réaliser cet objectif, les data scientists ont besoin (a) de mesures de comparaison de données pour différentes dimensions telles que l’âge pour les utilisateurs et le sujet pour le traffic réseau, et (b) d’algorithmes efficaces pour la détection de différences à grande échelle. Dans cette thèse, nous identifions et étudions trois nouvelles tâches analytiques : L’Exploration des Différences, l’Explication des Différences et l’Evolution des Différences.L’Exploration des Différences s’attaque à l’extraction de l’opinion de différents segments d’utilisateurs (ex., sur un site de films). Nous proposons des mesures adaptées à la com- paraison de distributions de notes attribuées par les utilisateurs, et des algorithmes efficaces qui permettent, à partir d’une opinion donnée, de trouver les segments qui sont d’accord ou pas avec cette opinion. L’Explication des Différences s’intéresse à fournir une explication succinte de la différence entre deux ensembles de données (ex., les habitudes d’achat de deux ensembles de clients). Nous proposons des fonctions de scoring permettant d’ordonner les explications, et des algorithmes qui guarantissent de fournir des explications à la fois concises et informatives. Enfin, l’Evolution des Différences suit l’évolution d’un ensemble de données dans le temps et résume cette évolution à différentes granularités de temps. Nous proposons une approche basée sur le requêtage qui utilise des mesures de similarité pour comparer des clusters consécutifs dans le temps. Nos index et algorithmes pour l’Evolution des Différences sont capables de traiter des données qui arrivent à différentes vitesses et des types de changements différents (ex., soudains, incrémentaux). L’utilité et le passage à l’échelle de tous nos algorithmes reposent sur l’exploitation de la hiérarchie dans les données (ex., temporelle, démographique).Afin de valider l’utilité de nos tâches analytiques et le passage à l’échelle de nos algo- rithmes, nous réalisons un grand nombre d’expériences aussi bien sur des données synthé- tiques que réelles.Nous montrons que l’Exploration des Différences guide les data scientists ainsi que les novices à découvrir l’opinion de plusieurs segments d’internautes à grande échelle. L’Explication des Différences révèle la nécessité de résumer les différences entre deux ensembles de donnes, de manière parcimonieuse et montre que la parcimonie peut être atteinte en exploitant les relations hiérarchiques dans les données. Enfin, notre étude sur l’Evolution des Différences fournit des preuves solides qu’une approche basée sur les requêtes est très adaptée à capturer des taux d’arrivée des données variés à plusieurs granularités de temps. De même, nous montrons que les approches de clustering sont adaptées à différents types de changement. / Variability in Big Data refers to data whose meaning changes continuously. For instance, data derived from social platforms and from monitoring applications, exhibits great variability. This variability is essentially the result of changes in the underlying data distributions of attributes of interest, such as user opinions/ratings, computer network measurements, etc. {em Difference Analysis} aims to study variability in Big Data. To achieve that goal, data scientists need: (a) measures to compare data in various dimensions such as age for users or topic for network traffic, and (b) efficient algorithms to detect changes in massive data. In this thesis, we identify and study three novel analytical tasks to capture data variability: {em Difference Exploration, Difference Explanation} and {em Difference Evolution}.Difference Exploration is concerned with extracting the opinion of different user segments (e.g., on a movie rating website). We propose appropriate measures for comparing user opinions in the form of rating distributions, and efficient algorithms that, given an opinion of interest in the form of a rating histogram, discover agreeing and disargreeing populations. Difference Explanation tackles the question of providing a succinct explanation of differences between two datasets of interest (e.g., buying habits of two sets of customers). We propose scoring functions designed to rank explanations, and algorithms that guarantee explanation conciseness and informativeness. Finally, Difference Evolution tracks change in an input dataset over time and summarizes change at multiple time granularities. We propose a query-based approach that uses similarity measures to compare consecutive clusters over time. Our indexes and algorithms for Difference Evolution are designed to capture different data arrival rates (e.g., low, high) and different types of change (e.g., sudden, incremental). The utility and scalability of all our algorithms relies on hierarchies inherent in data (e.g., time, demographic).We run extensive experiments on real and synthetic datasets to validate the usefulness of the three analytical tasks and the scalability of our algorithms. We show that Difference Exploration guides end-users and data scientists in uncovering the opinion of different user segments in a scalable way. Difference Explanation reveals the need to parsimoniously summarize differences between two datasets and shows that parsimony can be achieved by exploiting hierarchy in data. Finally, our study on Difference Evolution provides strong evidence that a query-based approach is well-suited to tracking change in datasets with varying arrival rates and at multiple time granularities. Similarly, we show that different clustering approaches can be used to capture different types of change.
|
345 |
Agrupamento espectral através de grafos Laplacianos e uma aplicação no cultivo da soja /Moura, Larissa. January 2018 (has links)
Orientador: Alice Kimie Miwa Libardi / Banca: Thiago de Melo / Banca: Washington Mio / Resumo: O objetivo desta dissertação é apresentar uma versão detalhada do artigo: "A Tutorial on Spectral Clustering" de U. von Luxburg sobre agrupamentos através de grafos Laplacianos, suas propriedades e mostrar alguns resultados da teoria de agrupamentos. Além disso, serão apresentados três algoritmos de agrupamentos e ilustraremos um deles com uma aplicação no cultivo da soja em diferentes condições de cultivo / Abstract: The main goal of this dissertation is to present a detailed version of the paper: " A Tutorial on Spectral Clustering" of U. von Luxburg on clusters, through Laplacian graphs, their properties and to show some results of the cluster theory. In addition, it will be presented three clustering algorithms and we will illustrate one of them with an application in the soybean cultivation, under different conditions / Mestre
|
346 |
Sistemáticas de agrupamento de países com base em indicadores de desempenho / Countries clustering systematics based on performance indexesMello, Paula Lunardi de January 2017 (has links)
A economia mundial passou por grandes transformações no último século, as quais incluiram períodos de crescimento sustentado seguidos por outros de estagnação, governos alternando estratégias de liberalização de mercado com políticas de protecionismo comercial e instabilidade nos mercados, dentre outros. Figurando como auxiliar na compreensão de problemas econômicos e sociais de forma sistêmica, a análise de indicadores de desempenho é capaz de gerar informações relevantes a respeito de padrões de comportamento e tendências, além de orientar políticas e estratégias para incremento de resultados econômicos e sociais. Indicadores que descrevem as principais dimensões econômicas de um país podem ser utilizados como norteadores na elaboração e monitoramento de políticas de desenvolvimento e crescimento desses países. Neste sentido, esta dissertação utiliza dados do Banco Mundial para aplicar e avaliar sistemáticas de agrupamento de países com características similares em termos dos indicadores que os descrevem. Para tanto, integra técnicas de clusterização (hierárquicas e não-hierárquicas), seleção de variáveis (por meio da técnica “leave one variable out at a time”) e redução dimensional (através da Análise de Componentes Principais) com vistas à formação de agrupamentos consistentes de países. A qualidade dos clusters gerados é avaliada pelos índices Silhouette, Calinski-Harabasz e Davies-Bouldin. Os resultados se mostraram satisfatórios quanto à representatividade dos indicadores destacados e qualidade da clusterização gerada. / The world economy faced transformations in the last century. Periods of sustained growth followed by others of stagnation, governments alternating strategies of market liberalization with policies of commercial protectionism, and instability in markets, among others. As an aid to understand economic and social problems in a systemic way, the analysis of performance indicators generates relevant information about patterns, behavior and trends, as well as guiding policies and strategies to increase results in economy and social issues. Indicators describing main economic dimensions of a country can be used guiding principles in the development and monitoring of development and growth policies of these countries. In this way, this dissertation uses data from World Bank to elaborate a system of grouping countries with similar characteristics in terms of the indicators that describe them. To do so, it integrates clustering techniques (hierarchical and non-hierarchical), selection of variables (through the "leave one variable out at a time" technique) and dimensional reduction (appling Principal Component Analysis). The generated clusters quality is evaluated by the Silhouette Index, Calinski-Harabasz and Davies-Bouldin indexes. The results were satisfactory regarding the representativity of the highlighted indicators and the generated a good clustering quality.
|
347 |
Agrupamento espectral através de grafos Laplacianos e uma aplicação no cultivo da soja. / Spectral clustering through Laplacian graphs and an application in soybean cultivation.Moura, Larissa 16 February 2018 (has links)
Submitted by Larissa Moura null (moura.larie@gmail.com) on 2018-02-26T11:39:11Z
No. of bitstreams: 1
moura_larissa_sjrp.pdf: 1591130 bytes, checksum: 7997e476e0c0da8c86b51d6ce91c8898 (MD5) / Approved for entry into archive by Elza Mitiko Sato null (elzasato@ibilce.unesp.br) on 2018-02-26T19:05:03Z (GMT) No. of bitstreams: 1
moura_l_me_sjrp.pdf: 1591130 bytes, checksum: 7997e476e0c0da8c86b51d6ce91c8898 (MD5) / Made available in DSpace on 2018-02-26T19:05:04Z (GMT). No. of bitstreams: 1
moura_l_me_sjrp.pdf: 1591130 bytes, checksum: 7997e476e0c0da8c86b51d6ce91c8898 (MD5)
Previous issue date: 2018-02-16 / Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES) / O objetivo desta dissertação é apresentar uma versão detalhada do artigo: “A Tutorial on Spectral Clustering” de U. von Luxburg sobre agrupamentos através de grafos Laplacianos, suas propriedades e mostrar alguns resultados da teoria de agrupamentos. Além disso, serão apresentados três algoritmos de agrupamentos e ilustraremos um deles com uma aplicação no cultivo da soja em diferentes condições de cultivo. / The main goal of this dissertation is to present a detailed version of the paper: “ A Tutorial on Spectral Clustering” of U. von Luxburg on clusters, through Laplacian graphs, their properties and to show some results of the cluster theory. In addition, it will be presented three clustering algorithms and we will illustrate one of them with an application in the soybean cultivation, under different conditions.
|
348 |
Contribui??es aos Processos de Clustering com Base em M?tricas n?o-EuclidianasMartins, Allan de Medeiros 08 March 2005 (has links)
Made available in DSpace on 2014-12-17T14:55:24Z (GMT). No. of bitstreams: 1
AllanMM_capaatecap3.pdf: 1884008 bytes, checksum: e5ac07ccdc460d8abf9ed5ff7c0400de (MD5)
Previous issue date: 2005-03-08 / In this work we present a new clustering method that groups up points of a data set in classes. The method is based in a algorithm to link auxiliary clusters that are obtained using traditional vector quantization techniques. It is described
some approaches during the development of the work that are based in measures of distances or dissimilarities (divergence) between the auxiliary clusters. This new
method uses only two a priori information, the number of auxiliary clusters Na and a threshold distance dt that will be used to decide about the linkage or not of the auxiliary clusters. The number os classes could be automatically found by the method, that do it based in the chosen threshold distance dt, or it is given as additional information to help in the choice of the correct threshold. Some analysis
are made and the results are compared with traditional clustering methods. In this work different dissimilarities metrics are analyzed and a new one is proposed based on the concept of negentropy. Besides grouping points of a set in classes, it is proposed a method to statistical modeling the classes aiming to obtain a expression to the probability of a point to belong to one of the classes. Experiments with several values of Na e dt are made in tests sets and the results are analyzed aiming to study the robustness of the method and to consider heuristics to the choice of the correct threshold. During this work it is explored the aspects of information theory applied to the calculation of the divergences. It will be explored specifically the different measures of information and divergence using the R?nyi
entropy. The results using the different metrics are compared and commented. The work also has appendix where are exposed real applications using the proposed method / Neste trabalho apresentamos um novo m?todo de clustering que agrupa pontos de um conjunto de dados em classes. O m?todo baseia-se em um algoritmo para liga??o de clusters auxiliares que s?o obtidos usando-se t?cnicas de quantiza??o
vetorial tradicionais. S?o descritas algumas abordagens durante o desenvolvimento do trabalho que baseiam-se em medidas de dist?ncia ou dissimilaridade (diverg?ncia)
entre os clusters auxiliares. Este novo m?todo utiliza apenas duas informa??es a priori, a saber: o n?mero de centros auxiliares Na e uma dist?ncia de limiar dt que ser? utilizada para decidir sobre a liga??o ou n?o dos clusters auxilares. O n?mero de clusters pode ser automaticamente encontrado pelo m?todo, que o faz com base na dist?ncia limiar dt escolhida. Analogamente, o n?mero de classes, pode ser fornecido como informa??o adicional para auxiliar na escolha do limiar correto. Algumas an?lises s?o feitas e os resultados s?o comparados com outros m?todos tradicionais
de clustering. Neste trabalho s?o analisadas diferentes m?tricas de dissimilaridade e uma nova m?trica baseada no conceito de negentropia ? proposta. Al?m de agrupar
pontos de um conjunto de classes, ? proposto um m?todo para o modelamento estat?stico das classes de modo a se obter uma express?o para a probabilidade de um ponto pertencer a uma das classes. Experimentos com diversos valores de Na e dt s?o realizados em conjuntos de teste
e os resultados s?o analisados de maneira a se estudar a robustez do m?todo e propor heur?sticas para a escolha do limiar correto. No trabalho s?o explorados os aspectos
de teoria da informa??o aplicados ao c?lculo das diverg?ncias. S?o exploradas em particular as diferen?as medidas de informa??o e diverg?ncia utilizando a entropia de R?nyi. Os resultados utilizando as diferentes m?tricas s?o comparados e comentados. O trabalho ainda conta com ap?ndices onde s?o expostas aplica??es reais utilizando o m?todo proposto
|
349 |
Sistemáticas de agrupamento de países com base em indicadores de desempenho / Countries clustering systematics based on performance indexesMello, Paula Lunardi de January 2017 (has links)
A economia mundial passou por grandes transformações no último século, as quais incluiram períodos de crescimento sustentado seguidos por outros de estagnação, governos alternando estratégias de liberalização de mercado com políticas de protecionismo comercial e instabilidade nos mercados, dentre outros. Figurando como auxiliar na compreensão de problemas econômicos e sociais de forma sistêmica, a análise de indicadores de desempenho é capaz de gerar informações relevantes a respeito de padrões de comportamento e tendências, além de orientar políticas e estratégias para incremento de resultados econômicos e sociais. Indicadores que descrevem as principais dimensões econômicas de um país podem ser utilizados como norteadores na elaboração e monitoramento de políticas de desenvolvimento e crescimento desses países. Neste sentido, esta dissertação utiliza dados do Banco Mundial para aplicar e avaliar sistemáticas de agrupamento de países com características similares em termos dos indicadores que os descrevem. Para tanto, integra técnicas de clusterização (hierárquicas e não-hierárquicas), seleção de variáveis (por meio da técnica “leave one variable out at a time”) e redução dimensional (através da Análise de Componentes Principais) com vistas à formação de agrupamentos consistentes de países. A qualidade dos clusters gerados é avaliada pelos índices Silhouette, Calinski-Harabasz e Davies-Bouldin. Os resultados se mostraram satisfatórios quanto à representatividade dos indicadores destacados e qualidade da clusterização gerada. / The world economy faced transformations in the last century. Periods of sustained growth followed by others of stagnation, governments alternating strategies of market liberalization with policies of commercial protectionism, and instability in markets, among others. As an aid to understand economic and social problems in a systemic way, the analysis of performance indicators generates relevant information about patterns, behavior and trends, as well as guiding policies and strategies to increase results in economy and social issues. Indicators describing main economic dimensions of a country can be used guiding principles in the development and monitoring of development and growth policies of these countries. In this way, this dissertation uses data from World Bank to elaborate a system of grouping countries with similar characteristics in terms of the indicators that describe them. To do so, it integrates clustering techniques (hierarchical and non-hierarchical), selection of variables (through the "leave one variable out at a time" technique) and dimensional reduction (appling Principal Component Analysis). The generated clusters quality is evaluated by the Silhouette Index, Calinski-Harabasz and Davies-Bouldin indexes. The results were satisfactory regarding the representativity of the highlighted indicators and the generated a good clustering quality.
|
350 |
Development of a hierarchical k-selecting clustering algorithm – application to allergy.Malm, Patrik January 2007 (has links)
The objective with this Master’s thesis was to develop, implement and evaluate an iterative procedure for hierarchical clustering with good overall performance which also merges features of certain already described algorithms into a single integrated package. An accordingly built tool was then applied to an allergen IgE-reactivity data set. The finally implemented algorithm uses a hierarchical approach which illustrates the emergence of patterns in the data. At each level of the hierarchical tree a partitional clustering method is used to divide data into k groups, where the number k is decided through application of cluster validation techniques. The cross-reactivity analysis, by means of the new algorithm, largely arrives at anticipated cluster formations in the allergen data, which strengthen results obtained through previous studies on the subject. Notably, though, certain unexpected findings presented in the former analysis where aggregated differently, and more in line with phylogenetic and protein family relationships, by the novel clustering package.
|
Page generated in 0.0796 seconds