121 |
Approach to Evaluating Clustering Using Classification Labelled DataLuu, Tuong January 2010 (has links)
Cluster analysis has been identified as a core task in data mining for which many different algorithms have been proposed. The diversity, on one hand, provides us a wide collection of tools. On the other hand, the profusion of options easily causes confusion. Given a particular task, users do not know which algorithm is good since it is not clear how clustering algorithms should be evaluated. As a consequence, users often select clustering algorithm in a very adhoc manner.
A major challenge in evaluating clustering algorithms is the scarcity of real data
with a "correct" ground truth clustering. This is in stark contrast
to the situation for classification tasks, where there
are abundantly many data sets labeled with their correct classifications. As a result, clustering research often relies on labeled data to evaluate and compare the results of clustering algorithms.
We present a new perspective on how to use labeled data for evaluating clustering algorithms, and develop an approach for comparing clustering algorithms on the basis of classification labeled data. We then use this approach to support a novel technique for choosing among clustering algorithms when no labels are available.
We use these tools to demonstrate that the utility of an algorithm depends on the specific clustering task. Investigating a set of common clustering
algorithms, we demonstrate that there are cases where each one of them
outputs better clusterings. In contrast to the current trend of looking for a superior clustering algorithm, our findings demonstrate the need for a variety of different clustering algorithms.
|
122 |
Effect of pendant distribution on the dispersancy of maleated ethylene propyleneAraya, Andrea 27 September 2011 (has links)
This study describes how changes made to the modification of a polyolefin affect the solution properties of these modified polyolefins in apolar solvents. The modified polyolefins of interest are maleated ethylene-propylene random copolymers (EP-MAH) reacted with N-phenyl-p-phenylenediamine (NP3D) to yield NP3D-EP-MAH. NP3D-EP-MAH is used as a dispersant by the oil-additive industry and solution properties such as self-aggregation, rheological behaviour, and its efficiency at stabilizing carbon black particles (CBPs) were investigated. The maleation of the polyolefin was characterized in terms of succinic anhydride (SAH) content and level of SAH clustering along the polymer backbone by FT-IR and UV-Vis absorption and steady-state and time-resolved fluorescence. The self-aggregation of the modified polyolefins was characterized in hexane by replacing NP3D with 1-pyrenemethylamine and using fluorescence to probe excimer formation between an excited and a ground-state pyrene. The rheological behaviour exhibited by the solutions of modified polyolefins was characterized from the viscosity profiles of the solutions obtained as a function of polymer concentration. Finally, the adsorption of the modified polyolefins onto CBPs was characterized by analysis of Langmuir isotherms, which yields both the equilibrium constant and the maximum coverage for the binding of the modified polyolefins onto CBPs. The conclusions reached in this thesis are that clustering of the SAH pendants along the EP backbone enhances the ability of the modified polyolefin to self-aggregate in apolar solution. In turn, self-aggregation led to enhanced thickening of the NP3D-EP-MAH solutions and stronger adsorption onto CBPs. This thesis establishes how the level of SAH clustering affects self-association and establishes its consequence on the rheological properties and adsorption isotherms of NP3D-EP-MAH samples in apolar solvents.
|
123 |
Towards Theoretical Foundations of ClusteringAckerman, Margareta January 2012 (has links)
Clustering is a central unsupervised learning task with a wide variety of applications. Unlike in supervised learning, different clustering algorithms may yield dramatically different outputs for the same input sets. As such, the choice of algorithm is crucial. When selecting a clustering algorithm, users tend to focus on cost-related considerations, such as running times, software purchasing costs, etc. Yet differences concerning the output of the algorithms are a more primal consideration. We propose an approach for selecting clustering algorithms based on differences in their input-output behaviour. This approach relies on identifying significant properties of clustering algorithms and classifying algorithms based on the properties that they satisfy.
We begin with Kleinberg's impossibility result, which relies on concise abstract properties that are well-suited for our approach. Kleinberg showed that three specific properties cannot be satisfied by the same algorithm. We illustrate that the impossibility result is a consequence of the formalism used, proving that these properties can be formulated without leading to inconsistency in the context of clustering quality measures or algorithms whose input requires the number of clusters.
Combining Kleinberg's properties with newly proposed ones, we provide an extensive property-base classification of common clustering paradigms. We use some of these properties to provide a novel characterization of the class of linkage-based algorithms. That is, we distil a small set of properties that uniquely identify this family of algorithms.
Lastly, we investigate how the output of algorithms is affected by the addition of small, potentially adversarial, sets of points. We prove that given clusterable input, the output of $k$-means is robust to the addition of a small number of data points. On the other hand, clusterings produced by many well-known methods, including linkage-based techniques, can be changed radically by adding a small number of elements.
|
124 |
Bayesian cluster validationKoepke, Hoyt Adam 11 1900 (has links)
We propose a novel framework based on Bayesian principles for validating clusterings and present efficient algorithms for use with centroid or exemplar based clustering solutions. Our framework treats the data as fixed and introduces perturbations into the clustering procedure. In our algorithms, we scale the distances between points by a random variable whose distribution is tuned against a baseline null dataset. The random variable is integrated out, yielding a soft assignment matrix that gives the behavior under perturbation of the points relative to each of the clusters. From this soft assignment matrix, we are able to visualize inter-cluster behavior, rank clusters, and give a scalar index of the the clustering stability. In a large test on synthetic data, our method matches or outperforms other leading methods at predicting the correct number of clusters. We also present a theoretical analysis of our approach, which suggests that it is useful for high dimensional data.
|
125 |
Clustering biological data using a hybrid approach : Composition of clusterings from different featuresKeller, Jens January 2008 (has links)
<p>Clustering of data is a well-researched topic in computer sciences. Many approaches have been designed for different tasks. In biology many of these approaches are hierarchical and the result is usually represented in dendrograms, e.g. phylogenetic trees. However, many non-hierarchical clustering algorithms are also well-established in biology. The approach in this thesis is based on such common algorithms. The algorithm which was implemented as part of this thesis uses a non-hierarchical graph clustering algorithm to compute a hierarchical clustering in a top-down fashion. It performs the graph clustering iteratively, with a previously computed cluster as input set. The innovation is that it focuses on another feature of the data in each step and clusters the data according to this feature. Common hierarchical approaches cluster e.g. in biology, a set of genes according to the similarity of their sequences. The clustering then reflects a partitioning of the genes according to their sequence similarity. The approach introduced in this thesis uses many features of the same objects. These features can be various, in biology for instance similarities of the sequences, of gene expression or of motif occurences in the promoter region. As part of this thesis not only the algorithm itself was implemented and evaluated, but a whole software also providing a graphical user interface. The software was implemented as a framework providing the basic functionality with the algorithm as a plug-in extending the framework. The software is meant to be extended in the future, integrating a set of algorithms and analysis tools related to the process of clustering and analysing data not necessarily related to biology.</p><p>The thesis deals with topics in biology, data mining and software engineering and is divided into six chapters. The first chapter gives an introduction to the task and the biological background. It gives an overview of common clustering approaches and explains the differences between them. Chapter two shows the idea behind the new clustering approach and points out differences and similarities between it and common clustering approaches. The third chapter discusses the aspects concerning the software, including the algorithm. It illustrates the architecture and analyses the clustering algorithm. After the implementation the software was evaluated, which is described in the fourth chapter, pointing out observations made due to the use of the new algorithm. Furthermore this chapter discusses differences and similarities to related clustering algorithms and software. The thesis ends with the last two chapters, namely conclusions and suggestions for future work. Readers who are interested in repeating the experiments which were made as part of this thesis can contact the author via e-mail, to get the relevant data for the evaluation, scripts or source code.</p>
|
126 |
A Novel Cooperative Algorithm for Clustering Large Databases With Sampling.FABRIS, F. 30 July 2012 (has links)
Made available in DSpace on 2016-08-29T15:33:17Z (GMT). No. of bitstreams: 1
tese_5121_.pdf: 735975 bytes, checksum: aeffd7d6fc81e4f73c1f18fb633dc4e1 (MD5)
Previous issue date: 2012-07-30 / Agrupamento de dados é uma tarefa recorrente em mineração de dados. Com o passar do tempo, vem se tornando mais importante o agrupamento de bases cada vez maiores. Contudo, aplicar heurísticas de agrupamento tradicionais em grandes bases não é uma tarefa fácil. Essas técnicas geralmente possuem complexidades pelo menos quadráticas no número de pontos da base, tornando o seu uso inviável pelo alto tempo de resposta ou pela baixa qualidade da solução final. A solução mais comumente utilizada para resolver o problema de agrupamento em bases de dados grandes é usar algoritmos especiais, mais fracos no ponto de vista da qualidade. Este
trabalho propõe uma abordagem diferente para resolver esse problema: o uso de algoritmos tradicionais, mais fortes, em um sub-conjunto dos dados originais. Esse sub-conjunto dos dados
originais é obtido com uso de um algoritmo co-evolutivo que seleciona um sub-conjunto de pontos difícil de agrupar.
|
127 |
Vers un système interactif de structuration des index pour une recherche par le contenu dans des grandes bases d'images / Towards an interactive index structuring system for content-based image retrieval in large image databasesLai, Hien Phuong 02 October 2013 (has links)
Cette thèse s'inscrit dans la problématique de l'indexation et la recherche d'images par le contenu dans des bases d'images volumineuses. Les systèmes traditionnels de recherche d'images par le contenu se composent généralement de trois étapes : l'indexation, la structuration et la recherche. Dans le cadre de cette thèse, nous nous intéressons plus particulièrement à l'étape de structuration qui vise à organiser, dans une structure de données, les signatures visuelles des images extraites dans la phase d'indexation afin de faciliter, d'accélérer et d'améliorer les résultats de la recherche ultérieure. A la place des méthodes traditionnelles de structuration, nous étudions les méthodes de regroupement des données (clustering) qui ont pour but d'organiser les signatures en groupes d'objets homogènes (clusters), sans aucune contrainte sur la taille des clusters, en se basant sur la similarité entre eux. Afin de combler le fossé sémantique entre les concepts de haut niveau sémantique exprimés par l'utilisateur et les signatures de bas niveau sémantique extraites automatiquement dans la phase d'indexation, nous proposons d'impliquer l'utilisateur dans la phase de clustering pour qu'il puisse interagir avec le système afin d'améliorer les résultats du clustering, et donc améliorer les résultats de la recherche ultérieure. En vue d'impliquer l'utilisateur dans la phase de clustering, nous proposons un nouveau modèle de clustering semi-supervisé interactif en utilisant les contraintes par paires (must-link et cannot-link) entre les groupes d'images. Tout d'abord, les images sont regroupées par le clustering non supervisé BIRCH (Zhang et al., 1996). Ensuite, l'utilisateur est impliqué dans la boucle d'interaction afin d'aider le clustering. Pour chaque itération interactive, l'utilisateur visualise les résultats de clustering et fournit des retours au système via notre interface interactive. Par des simples cliques, l'utilisateur peut spécifier les images positives ainsi que les images négatives pour chaque cluster. Il peut aussi glisser les images entre les clusters pour demander de changer l'affectation aux clusters des images. Les contraintes par paires sont ensuite déduites en se basant sur les retours de l'utilisateur ainsi que les informations de voisinage. En tenant compte de ces contraintes, le système réorganise les clusters en utilisant la méthode de clustering semi-supervisé proposée dans cette thèse. La boucle d'interaction peut être répétée jusqu'à ce que le résultat du clustering satisfasse l'utilisateur. Différentes stratégies pour déduire les contraintes par paires entre les images sont proposées. Ces stratégies sont analysées théoriquement et expérimentalement. Afin d'éviter que les résultats expérimentaux dépendent subjectivement de l'utilisateur humain, un agent logiciel simulant le comportement de l'utilisateur humain pour donner des retours est utilisé pour nos expérimentations. En comparant notre méthode avec la méthode de clustering semi-supervisé la plus populaire HMRF-kmeans (Basu et al., 2004), notre méthode donne de meilleurs résultats. / This thesis deals with the problem of Content-Based Image Retrieval (CBIR) on large image databases. Traditional CBIR systems generally rely on three phases : feature extraction, feature space structuring and retrieval. In this thesis, we are particularly interested in the structuring phase, which aims at organizing the visual feature descriptors of all images into an efficient data structure in order to facilitate, accelerate and improve further retrieval. The visual feature descriptor of each image is extracted from the feature extraction phase. Instead of traditional structuring methods, clustering methods which aim at organizing image descriptors into groups of similar objects (clusters), without any constraint on the cluster size, are studied. In order to reduce the “semantic gap” between high-level semantic concepts expressed by the user and the low-level features automatically extracted from the images, we propose to involve the user in the clustering phase so that he/she can interact with the system so as to improve the clustering results, and thus improve the results of further retrieval. With the aim of involving the user in the clustering phase, we propose a new interactive semi-supervised clustering model based on pairwise constraints (must-link and cannot-link) between groups of images. Firstly, images are organized into clusters by using the unsupervised clustering method BIRCH (Zhang et al., 1996). Then the user is involved into the interaction loop in order to guide the clustering process. In each interactive iteration, the user visualizes the clustering results and provide feedback to the system via our interactive interface. With some simple clicks, the user can specify the positive and/or negative images for each cluster. The user can also drag and drop images between clusters in order to change the cluster assignment of some images. Pairwise constraints are then deduced based on the user feedback as well as the neighbourhood information. By taking into account these constraints, the system re-organizes the data set, using the semi-supervised clustering proposed in this thesis. The interaction loop can be iterated until the clustering result satisfies the user. Different strategies for deducing pairwise constraints are proposed. These strategies are theoretically and experimentally analyzed. In order to avoid the subjective dependence of the clustering results on the human user, a software agent simulating the behaviour of the human user for providing feedback to the system is used in our experiments. By comparing our method with the most popular semi-supervised clustering HMRF-kmeans (Basu et al., 2004), our method gives better results.
|
128 |
Segmentación de los Contribuyentes que Declaran IVA Utilizando Técnicas de Data MiningLückeheide Codjambassis, Sandra January 2007 (has links)
No description available.
|
129 |
Auto-organisation des réseaux sans-fil multi-sauts dans les villes intelligentes / Self-organisation of multi-hop wireless networks in smart citiesDucrocq, Tony 15 November 2013 (has links)
Les villes actuelles sont de plus en plus connectées. Les compteurs sont relevés sans-fil et à distance. Les luminaires des villes communiquent pour économiser l’énergie tandis que les engins de ramassage communiquent avec les poubelles pour planifier les tournées. Ces réseaux sont sans infrastructure et les nœuds capteurs puisent leur énergie dans une batterie limitée.Dans cette thèse j'analyse les problématiques des réseaux sans-fil muti-sauts dans les villes intelligentes. J’étudie dans un premier temps l’importance et l’impact de la topologie sur les performances réseau. Plus précisément, au travers de simulations et d’études expérimentales, je démontre que le placement des nœuds impacte les performances des algorithmes de routage géographique. Je propose ensuite une famille d’algorithmes de clustering pour réseaux de capteurs sans-fil reposant sur l’hypothèse qu’un chef de cluster consomme plus d’énergie que les autres nœuds. L’idée principale de ces algorithmes est que le rôle de cluster-head doit être attribué en fonction du niveau d’énergie des nœuds et de leur voisinage. Ces algorithmes ont été testés grâce à des simulations sur des topologies de villes réalistes avec des paramètres de simulation tirés du monde réel. Enfin, je propose un algorithme de routage pour les villes intelligentes ayant pour base deux techniques de routage. Il repose sur l'hypothèse que seulement une partie des nœuds dispose de l'information de sa position. Je montre qu’il est possible d’obtenir des performances proches des algorithmes de routage géographique, même sous cette hypothèse. / In recent years, cities have become more and more connected. Readings of the electricity usage are being performed wirelessly. In order to reduce the energy consumption, lampposts became intelligent, while the garbage trucks communicate with dust bins in order to plan the garbage collection tours. These networks often lack the infrastructure and sensor nodes rely on a battery with a limited capacity. In this thesis, I analyze the issues of multi-hop wireless networks in smart cities. First, I study the significance and impact of network topologies on network performance. More precisely, through simulations and an experimental study, I show that the placement of nodes impacts the performance of geographic routing algorithms. Then, I propose a set of clustering algorithms for wireless sensor networks that optimize the lifetime of the network. The key hypothesis is that a cluster-head is consuming more energy than a regular node. This set of algorithms, named BLAC, creates multi-hop clusters in which each cluster-head is the root of a tree. The main idea is that the role of each cluster-head should be assigned regarding the remaining energy of nodes and their neighborhood. These algorithms have been tested through simulations based on realistic city topologies with simulation parameters resembling the real world.In the end, I propose a routing algorithm for large scale smart cities that combines two geographic routing techniques. It relies on the hypothesis that only a fraction of nodes in the network are aware of their positions. I show that it is possible to achieve the performance close to classical routing algorithms, even under this hypothesis.
|
130 |
Socket Migration for OpenMosixBowker, Ethan January 1900 (has links)
Master of Science / Department of Electrical and Computer Engineering / Dwight D. Day / Process migration is a technique in clustering and distributed computing by which parallel applications can be dynamically moved between nodes in a cluster in response to differing phases of execution, which is of growing usefulness in the field of distributed computing.
A drawback to many recent implementations of process migration is that sockets for interprocess communication do not migrate with the process requiring communication to be rerouted through the process' starting, or home, node, resulting in reduced communications performance when the process is migrated away from its home node.
This thesis focuses on the implemention a solution to this problem at the kernel level
for the OpenMosix process migration system with efficient socket handoff and cluster-wide unique addressing by reimplemting TCP on top of the existing network code in the Linux kernel.
Although falling short of the initial goal of fully transparent operation, this thesis presents a working implementation of migratable sockets for the OpenMosix process migration system that demonstrates working socket migration and improved performance over non-migrating sockets in OpenMosix.
|
Page generated in 0.0804 seconds