Global ETD Search

41	Εξόρυξη γνώσης από δεδομένα Οικονομάκης, Εμμανουήλ Κ. 20 October 2009 (has links) Στη συγκεκριμένη διπλωματική εργασία αναλύεται το πρόβλημα του εντοπισμού ομάδων σε σύνολα δεδομένων (ομαδοποίηση δεδομένων). Δίνεται μια σύντομη ανασκόπηση των μεθόδων που χρησιμοποιούνται σήμερα στην ομαδοποίηση δεδομένων και ιδιαίτερα στην ολοένα και αυξανόμενη χρήση Εξελικτικών Αλγόριθμων (ΕΑ) στην ομαδοποίηση. Οι ΕΑ έχουν αποδειχθεί ιδιαίτερα αποτελεσματικοί σε μια πληθώρα προβλημάτων βελτιστοποίησης. Η χρήση ΕΑ είναι αναμενόμενη, καθώς η ομαδοποίηση δεδομένων μπορεί να εκφραστεί και ως πρόβλημα βελτιστοποίησης. Επιπρόσθετα, παρουσιάζεται μια μέθοδος αντιμετώπισης της (συνήθως) μεγάλης διάστασης των προβλημάτων ομαδοποίησης, κάτι που επιβαρύνει ιδιαίτερα τους ΕΑ. Αναλυτικότερα, το πρώτο μέρος της διπλωματικής εργασίας παρέχει μια σφαιρική εικόνα του προβλήματος της ομαδοποίησης καθώς και των κατηγοριών των αλγορίθμων, που έχουν προταθεί για τον εντοπισμό ομάδων. Επιπλέον, παρουσιάζονται δομές δεδομένων που χρησιμοποιούνται από αλγόριθμους ομαδοποίησης για την επιτάχυνσή τους, όπως είναι τα Range Trees και τα BBD Trees. Εν συνεχεία, παρουσιάζονται αναλυτικά οι ΕΑ και ο τρόπος εφαρμογής τους σε προβλήματα ομαδοποίησης δεδομένων, αναλύοντας τρόπους αναπαράστασης του προβλήματος ομαδοποίησης, έτσι ώστε να είναι δυνατή η χρήση ΕΑ καθώς επίσης και οι μορφές των αντικειμενικών συναρτήσεων. Εισάγεται μια νέα προσέγγιση της εφαρμογής των ΕΑ σε προβλήματα ομαδοποίησης με σκοπό την πλήρη αποδέσμευση της διαδικασίας από εκτιμήσεις του πλήθους των ομάδων. Η διπλωματική εργασία κλείνει με τη σύγκριση υπάρχοντων αλγορίθμων ομαδοποίησης, που εφαρμόζουν την καθιερωμένη προσέγγιση της εφαρμογής των ΕΑ σε προβλήματα ομαδοποίησης, ένα νέο τρόπο εφαρμογής των ΕΑ, καθώς και κλασικούς αλγόριθμους όπως ο k-means και ο DBSCAN. Η σύγκριση γίνεται σε τεχνητά σύνολα δεδομένων, το κάθε ένα με διαφορετικές ιδιαιτερότητες. / In this master thesis, the problem of finding groups in data sets (data clustering) is analyzed. Data clustering methods in general and, more specifically, Evolutionary Algorithms (EA) based methods are shortly reviewed. EA's have proven to be effective in a extensive number of optimization problems. Since data clustering can be formulated as an optimization problem, EA can be utilized. Additionally, a method of reducing the (usually) large dimensionality of clustering problems is presented, since this hinders the performance and stability of EAs. The first part of this thesis provides an introduction to clustering as well as to existing clustering algorithms. Additionally, data structures used by clustering algorithms such as Range trees and BBD trees are described. After that, EA is described thoroughly as well as approaches of applying them on clustering problems, by analyzing forms of presenting a clustering problem in a way than an EA can be used, as well as and possible objective functions. A new approach of applying EAs on clustering problems is introduced, in an attempt to automatically determine the number of clusters present in a data set. Finally, an existing EA-based method and well known clustering algorithms such as k-means and DBSCAN are compared to the proposed approach. This comparison is made on artificial data sets, each one with its own characteristics. Εξόρυξη γνώσης 004.35 Data mining Computational intelligence Data clustering Evolutionary algorithms
42	Greedy Representative Selection for Unsupervised Data Analysis Helwa, Ahmed Khairy Farahat January 2012 (has links) In recent years, the advance of information and communication technologies has allowed the storage and transfer of massive amounts of data. The availability of this overwhelming amount of data stimulates a growing need to develop fast and accurate algorithms to discover useful information hidden in the data. This need is even more acute for unsupervised data, which lacks information about the categories of different instances. This dissertation addresses a crucial problem in unsupervised data analysis, which is the selection of representative instances and/or features from the data. This problem can be generally defined as the selection of the most representative columns of a data matrix, which is formally known as the Column Subset Selection (CSS) problem. Algorithms for column subset selection can be directly used for data analysis or as a pre-processing step to enhance other data mining algorithms, such as clustering. The contributions of this dissertation can be summarized as outlined below. First, a fast and accurate algorithm is proposed to greedily select a subset of columns of a data matrix such that the reconstruction error of the matrix based on the subset of selected columns is minimized. The algorithm is based on a novel recursive formula for calculating the reconstruction error, which allows the development of time and memory-efficient algorithms for greedy column subset selection. Experiments on real data sets demonstrate the effectiveness and efficiency of the proposed algorithms in comparison to the state-of-the-art methods for column subset selection. Second, a kernel-based algorithm is presented for column subset selection. The algorithm greedily selects representative columns using information about their pairwise similarities. The algorithm can also calculate a Nyström approximation for a large kernel matrix based on the subset of selected columns. In comparison to different Nyström methods, the greedy Nyström method has been empirically shown to achieve significant improvements in approximating kernel matrices, with minimum overhead in run time. Third, two algorithms are proposed for fast approximate k-means and spectral clustering. These algorithms employ the greedy column subset selection method to embed all data points in the subspace of a few representative points, where the clustering is performed. The approximate algorithms run much faster than their exact counterparts while achieving comparable clustering performance. Fourth, a fast and accurate greedy algorithm for unsupervised feature selection is proposed. The algorithm is an application of the greedy column subset selection method presented in this dissertation. Similarly, the features are greedily selected such that the reconstruction error of the data matrix is minimized. Experiments on benchmark data sets show that the greedy algorithm outperforms state-of-the-art methods for unsupervised feature selection in the clustering task. Finally, the dissertation studies the connection between the column subset selection problem and other related problems in statistical data analysis, and it presents a unified framework which allows the use of the greedy algorithms presented in this dissertation to solve different related problems. Data Mining Machine Learning Unsupervised Data Analysis Greedy Algorithms Representative Selection Feature Selection Data Clustering Electrical and Computer Engineering
43	Agrupamento de dados superparamagnético ALMEIDA, Evert Elvis Batista de 26 February 2009 (has links) Submitted by (ana.araujo@ufrpe.br) on 2016-07-05T16:55:56Z No. of bitstreams: 1 Evert Elvis Batista Almeida.pdf: 8214568 bytes, checksum: 34db767d9a38f53b7b60aaf92ca37a20 (MD5) / Made available in DSpace on 2016-07-05T16:55:56Z (GMT). No. of bitstreams: 1 Evert Elvis Batista Almeida.pdf: 8214568 bytes, checksum: 34db767d9a38f53b7b60aaf92ca37a20 (MD5) Previous issue date: 2009-02-26 / We applied a non-supervisioned data clustering technique based on a map of the problem into an inhomogeneous granular magnet problem. The physical behavior of the magnet is studied through the usual Monte Carlo method. Each data item is described by a set of numerical attributes, interpreted as points in a multiple-dimensional Euclidian space. The mapping consists in associating a Potts spin to each data point. The physical system is described by a disordered Potts Hamiltonian with several states with an exponentially decaying interaction among spins. The magnet reaches a superparamagnetic state at high temperatures in which the spins in certain grains are strongly correlated whereas the grains are loosely linked. In this way, each grain corresponds to a group or cluster. We implemented the method in a microcanonical ensemble where the conserved total energy is the control parameter. The temperature is calculated during the simulation and, besides thermodynamic stable states, it is possible to sample unstable and metastable state as well. We work with three artificial multiple-dimensional data set and a four-dimensional real data set. We obtained good results in all cases and discuss some issues concerning the microcanonical implementation of the superparamagnetic data clustering. / Aplicamos um método não supervisionado de agrupamento de dados para identificar padrões em vários conjuntos dados. A técnica baseia-se em um mapeamento do problema em um sistema magnético granular heterogêneo, cujo comportamento é investigado através de métodos Monte Carlo comumente empregado no campo da física estatística. Cada objeto é descrito por um conjunto de atributos de valores numéricos, interpretados como um ponto em um espaço euclidiano de dimensão apropriada. O mapeamento consiste em associar a cada item do conjunto, um ponto no espaço, um spin de Potts. O sistema físico é descrito por um hamiltoniano de Potts de muitos estados, no qual a interação entre os spins decai exponencialmente com a distância entre eles. Itens semelhantes, próximos, interagem fortemente enquanto que aqueles mais distantes entre si interagem apenas fracamente. O magneto atinge um estado superparamagnético para temperaturas suficientemente altas, no qual os spins de alguns grãos permanecem fortemente correlacionados, porém, os grãos estão fracamente ligados entre si. Cada grão corresponde a um grupo. Implementamos o método no ensemble microcanônico, no qual a energia total é conservada e constitui o parâmetro de controle. Nesse caso, a temperatura é calculada ao longo do processo e podemos acessar estados termodinamicamente estáveis, metaestáveis, bem como, instáveis. Trabalhamos com três conjuntos artificiais de dados, em duas e três dimensões, e um conjunto de dados reais com quatro dimensões. O desempenho do método foi satisfatório em todos os casos investigados. Agrupamento de dados Reconhecimento de padrões Simulação no ensemble microcanônico Data clustering Pattern recognition Microcanonical ensemble simulation
44	Seleção de algoritmos para a tarefa de agrupamento de dados: uma abordagem via meta-aprendizagem Ferrari, Daniel Gomes 27 March 2014 (has links) Made available in DSpace on 2016-03-15T19:38:50Z (GMT). No. of bitstreams: 1 Daniel Gomes Ferrari.pdf: 2637416 bytes, checksum: 535856887beb7ff04af53570120bc1f9 (MD5) Previous issue date: 2014-03-27 / Natcomp Informatica e Equipamentos Eletronicos LTDA / Data clustering is an important data mining task that aims to segment a database into groups of objects based on their similarity or dissimilarity. Due to the unsupervised nature of clustering, the search for a good quality solution can become a complex process. There is currently a wide range of clustering algorithms and selecting the most suitable one for a given problem can be a slow and costly process. In 1976, Rice formulated the algorithm selection problem (PSA) postulating that a good performance algorithm can be chosen according to the problem s structural characteristics. Meta-learning brings the concept of learning about learning, that is, the meta-knowledge obtained from the algorithms learning process allows it to improve its performance. Meta-learning has a major intersection with data mining in classification problems, where it is used to select algorithms. This thesis proposes an approach to the algorithm selection problem by using meta-learning techniques for clustering. The characterization of 84 problems is performed by a classical approach, based on the problems, and a new proposal based on the similarity among the objects. Ten internal indices are used to provide different performance assessments of seven algorithms, where the combination of the indices determine the ranking for the algorithms. Several analyzes are performed in order to assess the quality of the obtained meta-knowledge in facilitating the mapping between the problem s features and the performance of the algorithms. The results show that the new characterization approach and method to combine the indices provide a good quality algorithm selection mechanism for data clustering problems. / Agrupamento é uma tarefa importante na mineração de dados, tendo como objetivo segmentar uma base de dados em grupos de objetos baseando-se na similaridade ou dissimilaridade entre os mesmos. Devido à natureza não supervisionada da tarefa, a busca por uma solução de boa qualidade pode se tornar um processo complexo. Atualmente, existe na literatura acadêmica uma grande quantidade de algoritmos que podem ser utilizados na resolução deste problema. A seleção do algoritmo mais adequado para um determinado problema pode ser um processo lento e custoso. Em 1976, Rice formulou o Problema de Seleção de Algoritmos (PSA), postulando que um algoritmo de bom desempenho pode ser escolhido de acordo com as características estruturais do problema em que o mesmo será aplicado. A meta-aprendizagem traz consigo o conceito de aprender sobre o aprender, isto é, por meio do meta-conhecimento obtido do processo de aprendizagem dos algoritmos é possível aprimorar o desempenho do processo. Meta-aprendizagem possui grande interseção com mineração de dados no que tange problemas de classificação, sendo utilizada no desenvolvimento de sistemas de seleção de algoritmos. Nesta tese é proposta a abordagem ao PSA por meio de técnicas de meta-aprendizagem para agrupamento de dados. A caracterização de 84 problemas é realizada pela abordagem clássica, baseada nos problemas, e por uma nova proposta baseada na similaridade entre os objetos. São utilizados dez índices internos para promover diferentes avaliações do desempenho de sete algoritmos, onde a combinação desses índices determina o ranking dos algoritmos. São realizadas diversas análises no intuito de avaliar a qualidade do meta-conhecimento obtido em viabilizar o mapeamento entre as características do problema e o desempenho dos algoritmos. Os resultados mostram que a nova caracterização e combinação dos índices proporcionam a seleção, com qualidade, de algoritmos para agrupamento de dados. agrupamento de dados meta-aprendizagem meta-conhecimento seleção de algoritmos data clustering meta-learning meta-knowledge algorithm selection CNPQ::ENGENHARIAS::ENGENHARIA ELETRICA
45	Um modelo dinâmico de clusterização de dados aplicado na detecção de intrusão Rogério Akiyoshi Furukawa 25 April 2003 (has links) Atualmente, a segurança computacional vem se tornando cada vez mais necessária devido ao grande crescimento das estatísticas que relatam os crimes computacionais. Uma das ferramentas utilizadas para aumentar o nível de segurança é conhecida como Sistemas de Detecção de Intrusão (SDI). A flexibilidade e usabilidade destes sistemas têm contribuído, consideravelmente, para o aumento da proteção dos ambientes computacionais. Como grande parte das intrusões seguem padrões bem definidos de comportamento em uma rede de computadores, as técnicas de classificação e clusterização de dados tendem a ser muito apropriadas para a obtenção de uma forma eficaz de resolver este tipo de problema. Neste trabalho será apresentado um modelo dinâmico de clusterização baseado em um mecanismo de movimentação dos dados. Apesar de ser uma técnica de clusterização de dados aplicável a qualquer tipo de dados, neste trabalho, este modelo será utilizado para a detecção de intrusão. A técnica apresentada neste trabalho obteve resultados de clusterização comparáveis com técnicas tradicionais. Além disso, a técnica proposta possui algumas vantagens sobre as técnicas tradicionais investigadas, como realização de clusterizações multi-escala e não necessidade de determinação do número inicial de clusters / Nowadays, the computational security is becoming more and more necessary due to the large growth of the statistics that describe computer crimes. One of the tools used to increase the safety level is named Intrusion Detection Systems (IDS). The flexibility and usability of these systems have contributed, considerably, to increase the protection of computational environments. As large part of the intrusions follows behavior patterns very well defined in a computers network, techniques for data classification and clustering tend to be very appropriate to obtain an effective solutions to this problem. In this work, a dynamic clustering model based on a data movement mechanism are presented. In spite of a clustering technique applicable to any data type, in this work, this model will be applied to the detection intrusion. The technique presented in this work obtained clustering results comparable to those obtained by traditional techniques. Besides the proposed technique presents some advantages on the traditional techniques investigated, like multi-resolution clustering and no need to previously know the number of clusters Análise dos componentes principais Clusterização de dados Sistemas de detecção de intrusão Data clustering Intrusion detection systems Principal analisys component
46	Partitioning A Graph In Alliances And Its Application To Data Clustering Hassan-Shafique, Khurram 01 January 2004 (has links) Any reasonably large group of individuals, families, states, and parties exhibits the phenomenon of subgroup formations within the group such that the members of each group have a strong connection or bonding between each other. The reasons of the formation of these subgroups that we call alliances differ in different situations, such as, kinship and friendship (in the case of individuals), common economic interests (for both individuals and states), common political interests, and geographical proximity. This structure of alliances is not only prevalent in social networks, but it is also an important characteristic of similarity networks of natural and unnatural objects. (A similarity network defines the links between two objects based on their similarities). Discovery of such structure in a data set is called clustering or unsupervised learning and the ability to do it automatically is desirable for many applications in the areas of pattern recognition, computer vision, artificial intelligence, behavioral and social sciences, life sciences, earth sciences, medicine, and information theory. In this dissertation, we study a graph theoretical model of alliances where an alliance of the vertices of a graph is a set of vertices in the graph, such that every vertex in the set is adjacent to equal or more vertices inside the set than the vertices outside it. We study the problem of partitioning a graph into alliances and identify classes of graphs that have such a partition. We present results on the relationship between the existence of such a partition and other well known graph parameters, such as connectivity, subgraph structure, and degrees of vertices. We also present results on the computational complexity of finding such a partition. An alliance cover set is a set of vertices in a graph that contains at least one vertex from every alliance of the graph. The complement of an alliance cover set is an alliance free set, that is, a set that does not contain any alliance as a subset. We study the properties of these sets and present tight bounds on their cardinalities. In addition, we also characterize the graphs that can be partitioned into alliance free and alliance cover sets. Finally, we present an approximate algorithm to discover alliances in a given graph. At each step, the algorithm finds a partition of the vertices into two alliances such that the alliances are strongest among all such partitions. The strength of an alliance is defined as a real number p, such that every vertex in the alliance has at least p times more neighbors in the set than its total number of neighbors in the graph). We evaluate the performance of the proposed algorithm on standard data sets. vertex partitions data clustering alliances defensive alliances offensive alliances powerful alliances alliance free sets alliance cover sets Computer Sciences Engineering
47	Learning Techniques For Information Retrieval And Mining In High-dimensional Databases Cheng, Hao 01 January 2009 (has links) The main focus of my research is to design effective learning techniques for information retrieval and mining in high-dimensional databases. There are two main aspects in the retrieval and mining research: accuracy and efficiency. The accuracy problem is how to return results which can better match the ground truth, and the efficiency problem is how to evaluate users' requests and execute learning algorithms as fast as possible. However, these problems are non-trivial because of the complexity of the high-level semantic concepts, the heterogeneous natures of the feature space, the high dimensionality of data representations and the size of the databases. My dissertation is dedicated to addressing these issues. Specifically, my work has five main contributions as follows. The first contribution is a novel manifold learning algorithm, Local and Global Structures Preserving Projection (LGSPP), which defines salient low-dimensional representations for the high-dimensional data. A small number of projection directions are sought in order to properly preserve the local and global structures for the original data. Specifically, two groups of points are extracted for each individual point in the dataset: the first group contains the nearest neighbors of the point, and the other set are a few sampled points far away from the point. These two point sets respectively characterize the local and global structures with regard to the data point. The objective of the embedding is to minimize the distances of the points in each local neighborhood and also to disperse the points far away from their respective remote points in the original space. In this way, the relationships between the data in the original space are well preserved with little distortions. The second contribution is a new constrained clustering algorithm. Conventionally, clustering is an unsupervised learning problem, which systematically partitions a dataset into a small set of clusters such that data in each cluster appear similar to each other compared with those in other clusters. In the proposal, the partial human knowledge is exploited to find better clustering results. Two kinds of constraints are integrated into the clustering algorithm. One is the must-link constraint, indicating that the involved two points belong to the same cluster. On the other hand, the cannot-link constraint denotes that two points are not within the same cluster. Given the input constraints, data points are arranged into small groups and a graph is constructed to preserve the semantic relations between these groups. The assignment procedure makes a best effort to assign each group to a feasible cluster without violating the constraints. The theoretical analysis reveals that the probability of data points being assigned to the true clusters is much higher by the new proposal, compared to conventional methods. In general, the new scheme can produce clusters which can better match the ground truth and respect the semantic relations between points inferred from the constraints. The third contribution is a unified framework for partition-based dimension reduction techniques, which allows efficient similarity retrieval in the high-dimensional data space. Recent similarity search techniques, such as Piecewise Aggregate Approximation (PAA), Segmented Means (SMEAN) and Mean-Standard deviation (MS), prove to be very effective in reducing data dimensionality by partitioning dimensions into subsets and extracting aggregate values from each dimension subset. These partition-based techniques have many advantages including very efficient multi-phased pruning while being simple to implement. They, however, are not adaptive to different characteristics of data in diverse applications. In this study, a unified framework for these partition-based techniques is proposed and the issue of dimension partitions is examined in this framework. An investigation of the relationships of query selectivity and the dimension partition schemes discovers indicators which can predict the performance of a partitioning setting. Accordingly, a greedy algorithm is designed to effectively determine a good partitioning of data dimensions so that the performance of the reduction technique is robust with regard to different datasets. The fourth contribution is an effective similarity search technique in the database of point sets. In the conventional model, an object corresponds to a single vector. In the proposed study, an object is represented by a set of points. In general, this new representation can be used in many real-world applications and carries much more local information, but the retrieval and learning problems become very challenging. The Hausdorff distance is the common distance function to measure the similarity between two point sets, however, this metric is sensitive to outliers in the data. To address this issue, a novel similarity function is defined to better capture the proximity of two objects, in which a one-to-one mapping is established between vectors of the two objects. The optimal mapping minimizes the sum of distances between each paired points. The overall distance of the optimal matching is robust and has high retrieval accuracy. The computation of the new distance function is formulated into the classical assignment problem. The lower-bounding techniques and early-stop mechanism are also proposed to significantly accelerate the expensive similarity search process. The classification problem over the point-set data is called Multiple Instance Learning (MIL) in the machine learning community in which a vector is an instance and an object is a bag of instances. The fifth contribution is to convert the MIL problem into a standard supervised learning in the conventional vector space. Specially, feature vectors of bags are grouped into clusters. Each object is then denoted as a bag of cluster labels, and common patterns of each category are discovered, each of which is further reconstructed into a bag of features. Accordingly, a bag is effectively mapped into a feature space defined by the distances from this bag to all the derived patterns. The standard supervised learning algorithms can be applied to classify objects into pre-defined categories. The results demonstrate that the proposal has better classification accuracy compared to other state-of-the-art techniques. In the future, I will continue to explore my research in large-scale data analysis algorithms, applications and system developments. Especially, I am interested in applications to analyze the massive volume of online data. similarity search dimension reduction data clustering constrained clustering manifold learning query processing multiple instance learning Computer Sciences Engineering
48	Approximate Clustering Algorithms for High Dimensional Streaming and Distributed Data Carraher, Lee A. 22 May 2018 (has links) No description available. Computer Engineering data clustering distributed data mining streaming data algorithms locality sensitive hashing count-min cut tree random projection
49	Complex network component unfolding using a particle competition technique / Desdobramento de componentes de redes complexas utilizando uma técnica de competição de partículas Urio, Paulo Roberto 12 June 2017 (has links) This work applies complex network theory to the problem of semi-supervised and unsupervised learning in networks that are representations of multivariate datasets. Complex networks allow the use of nonlinear dynamical systems to represent behaviors according to the connectivity patterns of networks. Inspired by behavior observed in nature, such as competition for limited resources, dynamical system models can be employed to uncover the organizational structure of a network. In this dissertation, we develop a technique for classifying data represented as interaction networks. As part of the technique, we model a dynamical system inspired by the biological dynamics of resource competition. So far, similar methods have focused on vertices as the resource of competition. We introduce edges as the resource of competition. In doing so, the connectivity pattern of a network might be used not only in the dynamical system simulation but in the learning task as well. / Este trabalho aplica a teoria de redes complexas para o estudo de uma técnica aplicada ao problema de aprendizado semissupervisionado e não-supervisionado em redes, especificamente, aquelas que representam conjuntos de dados multivariados. Redes complexas permitem o emprego de sistemas dinâmicos não-lineares que podem apresentar comportamentos de acordo com os padrões de conectividade de redes. Inspirado pelos comportamentos observados na natureza, tais como a competição por recursos limitados, sistema dinâmicos podem ser utilizados para revelar a estrutura da organização de uma rede. Nesta dissertação, desenvolve-se uma técnica aplicada ao problema de classificação de dados representados por redes de interação. Como parte da técnica, um sistema dinâmico inspirado na competição por recursos foi modelado. Métodos similares concentraram-se em vértices como o recurso da concorrência. Neste trabalho, introduziu-se arestas como o recurso-alvo da competição. Ao fazê-lo, utilizar-se-á o padrão de conectividade de uma rede tanto na simulação do sistema dinâmico, quanto na tarefa de aprendizado. Agrupamento de dados Aprendizado de máquina Aprendizado semissupervisionado Community detection Complex networks Data clustering Detecção de comunidades Machine learning Redes complexas Semi-supervised learning
50	Análise de agrupamentos baseada na topologia dos dados e em mapas auto-organizáveis. / Data clustering based on data topology and self organizing-maps. Boscarioli, Clodis 16 May 2008 (has links) Cada vez mais, na conjuntura das grandes tomadas de decisões, a análise de dados massivamente armazenados se torna uma necessidade das mais variadas áreas de conhecimento. A análise de dados envolve a realização de diferentes tarefas, que podem ser realizadas por diferentes técnicas e estratégias como análise de agrupamento de dados. Esta pesquisa enfatiza a realização da tarefa de análise de agrupamento de dados (Data Clustering) usando SOM (Self-Organizing Maps) como principal artefato. SOM é uma rede neural artificial baseada em aprendizado competitivo e não-supervisionado, o que significa que o treinamento é inteiramente guiado pelos dados e que os neurônios do mapa competem entre si. Essa rede neural possui a habilidade de formar mapeamentos que quantizam os dados, preservando a sua topologia. Este trabalho introduz uma nova metodologia de análise de agrupamentos a partir de SOM, que considera o mapa topológico gerado por ele e a topologia dos dados no processo de agrupamento. Uma análise experimental e comparativa é apresentada, evidenciando a potencialidade da proposta, destacando, por fim, as principais contribuições do trabalho. / More than ever, in environment of large decision making, the analysis of data stored massively becomes a real need in almost all knowledge areas. The data analyzing process covers the performing of different tasks that can be executed for different techniques and strategies as the data clustering analysis. This research is focused on the analysis task of data groups, called Data Clustering using Self Organizing Maps (SOM) as principal artifact. SOM is an artificial neural network based on competitive and unsupervised learning, what means that its training is entirely driven by the data, such the neurons of the map compete themselves for doing it. This neural network has the ability to build the mapping task that quantifies the source data, but preserving the topology. This work introduces a new clustering analysis methodology based on SOM, considering the topological map produced by it and also the topology of the data obtained in the clustering process. The experimental and comparative analysis are also presented to demonstrate the potential of the proposal, highlighting at the end the mainly contributions of the work. Análise de agrupamentos Análise exploratória de dados Data clustering Data mining Descoberta de conhecimento Exploratory data analysis Knowledge discovery Mapas Auto-organizáveis (SOM) Mineração de dados Self-organizing Maps (SOM)

Search results