Spelling suggestions: "subject:"data clustering"" "subject:"mata clustering""
61 |
Categorização de imagens médicas baseada em transformada wavelet e mapas auto-organizáveis. / Medical image categorization based in wavelet transform and self-organizing maps.Leandro Augusto da Silva 25 March 2009 (has links)
Nos tempos atuais, as imagens médicas são fonte de dados fundamentais na medicina moderna. As imagens armazenadas em uma base de dados de acordo com as respectivas categorias são um importante passo para aplicações como mineração de dados e recuperação de imagens por conteúdo. Estas aplicações podem apoiar médicos e estudantes na decisão de diagnóstico, permitir pesquisas e ser usadas como material didático. O trabalho propõe o uso de Mapas Auto-Organizáveis (SOM) e TransformadaWavelet combinada com momentos de Hu para a categorização de imagens médicas. Para tanto, são realizados experimentos para definição do tamanho do mapa SOM, uso do mesmo na categorização, definição da melhor família wavelet e nível de decomposição, sumarização dos coeficientes wavelets descartados por momento de Hu e experimentos comparativos com outras abordagens de categorização. Além dos experimentos de classificação comparativos em termos de taxa de acerto, é apresentada uma proposta de contribuição para uso do Mapa SOM na classificação. Nesta proposta, os resultados de classificação e o tempo de recurso computacional despendido pelo Mapa SOM mostram-se eficientes, quando comparados aos resultados e tempo apresentados pelo tradicional classificador K vizinhos mais próximos. / Nowadays, images are fundamental data source in modern medicine. The images stored in a database according with categories are an important step for data mining and contentbased image retrieval. They can support doctors and students in diagnostic decisions and provide research and didactic material. This work addresses the use of Self-Organizing Map (SOM) and discrete wavelet transform joint with Hus moments to medical image categorization. Furthermore, extensive experiments to define map size were done, employing the map in categorization, the best wavelet family and level of decomposition were defined, the coefficient discarded was summarized by Hus moments and contrastive studies with another successfull approach of categorization were done. Moreover, an approach to use SOM map in categorization is addressed, in which the SOM map for classification carried on better performance and computational time than traditional K nearest neighbor algorithm.
|
62 |
Análise de agrupamentos baseada na topologia dos dados e em mapas auto-organizáveis. / Data clustering based on data topology and self organizing-maps.Clodis Boscarioli 16 May 2008 (has links)
Cada vez mais, na conjuntura das grandes tomadas de decisões, a análise de dados massivamente armazenados se torna uma necessidade das mais variadas áreas de conhecimento. A análise de dados envolve a realização de diferentes tarefas, que podem ser realizadas por diferentes técnicas e estratégias como análise de agrupamento de dados. Esta pesquisa enfatiza a realização da tarefa de análise de agrupamento de dados (Data Clustering) usando SOM (Self-Organizing Maps) como principal artefato. SOM é uma rede neural artificial baseada em aprendizado competitivo e não-supervisionado, o que significa que o treinamento é inteiramente guiado pelos dados e que os neurônios do mapa competem entre si. Essa rede neural possui a habilidade de formar mapeamentos que quantizam os dados, preservando a sua topologia. Este trabalho introduz uma nova metodologia de análise de agrupamentos a partir de SOM, que considera o mapa topológico gerado por ele e a topologia dos dados no processo de agrupamento. Uma análise experimental e comparativa é apresentada, evidenciando a potencialidade da proposta, destacando, por fim, as principais contribuições do trabalho. / More than ever, in environment of large decision making, the analysis of data stored massively becomes a real need in almost all knowledge areas. The data analyzing process covers the performing of different tasks that can be executed for different techniques and strategies as the data clustering analysis. This research is focused on the analysis task of data groups, called Data Clustering using Self Organizing Maps (SOM) as principal artifact. SOM is an artificial neural network based on competitive and unsupervised learning, what means that its training is entirely driven by the data, such the neurons of the map compete themselves for doing it. This neural network has the ability to build the mapping task that quantifies the source data, but preserving the topology. This work introduces a new clustering analysis methodology based on SOM, considering the topological map produced by it and also the topology of the data obtained in the clustering process. The experimental and comparative analysis are also presented to demonstrate the potential of the proposal, highlighting at the end the mainly contributions of the work.
|
63 |
Association Pattern Analysis for Pattern Pruning, Clustering and SummarizationLi, Chung Lam 12 September 2008 (has links)
Automatic pattern mining from databases and the analysis of the discovered patterns for useful information are important and in great demand in science, engineering and business. Today, effective pattern mining methods, such as association rule mining and pattern discovery, have been developed and widely used in various challenging industrial and business applications. These methods attempt to uncover the valuable information trapped in large collections of raw data. The patterns revealed provide significant and useful information for decision makers. Paradoxically, pattern mining itself can produce such huge amounts of data that poses a new knowledge management problem: to tackle thousands or even more patterns discovered and held in a data set. Unlike raw data, patterns often overlap, entangle and interrelate to each other in the databases. The relationship among them is usually complex and the notion of distance between them is difficult to qualify and quantify. Such phenomena pose great challenges to the existing data mining discipline. In this thesis, the analysis of patterns after their discovery by existing pattern mining methods is referred to as pattern post-analysis since the patterns to be analyzed are first discovered.
Due to the overwhelmingly huge volume of discovered patterns in pattern mining, it is virtually impossible for a human user to manually analyze them. Thus, the valuable trapped information in the data is shifted to a large collection of patterns. Hence, to automatically analyze the patterns discovered and present the results in a user-friendly manner such as pattern post-analysis is badly needed. This thesis attempts to solve the problems listed below. It addresses 1) the important factors contributing to the interrelating relationship among patterns and hence more accurate measurements of distances between them; 2) the objective pruning of redundant patterns from the discovered patterns; 3) the objective clustering of the patterns into coherent pattern clusters for better organization; 4) the automatic summarization of each pattern cluster for human interpretation; and 5) the application of pattern post-analysis to large database analysis and data mining.
In this thesis, the conceptualization, theoretical formulation, algorithm design and system development of pattern post-analysis of categorical or discrete-valued data is presented. It starts with presenting a natural dual relationship between patterns and data. The relationship furnishes an explicit one-to-one correspondence between a pattern and its associated data and provides a base for an effective analysis of patterns by relating them back to the data. It then discusses the important factors that differentiate patterns and formulates the notion of distances among patterns using a formal graphical approach. To accurately measure the distances between patterns and their associated data, both the samples and the attributes matched by the patterns are considered. To achieve this, the distance measure between patterns has to account for the differences of their associated data clusters at the attribute value (i.e. item) level. Furthermore, to capture the degree of variation of the items matched by patterns, entropy-based distance measures are developed. It attempts to quantify the uncertainty of the matched items. Such distances render an accurate and robust distance measurement between patterns and their associated data. To understand the properties and behaviors of the new distance measures, the mathematical relation between the new distances and the existing sample-matching distances is analytically derived.
The new pattern distances based on the dual pattern-data relationship and their related concepts are used and adapted to pattern pruning, pattern clustering and pattern summarization to furnish an integrated, flexible and generic framework for pattern post-analysis which is able to meet the challenges of today’s complex real-world problems. In pattern pruning, the system defines the amount of redundancy of a pattern with respect to another pattern at the item level. Such definition generalizes the classical closed itemset pruning and maximal itemset pruning which define redundancy at the sample level. A new generalized itemset pruning method is developed using the new definition. It includes the closed and maximal itemsets as two extreme special cases and provides a control parameter for the user to adjust the tradeoff between the number of patterns being pruned and the amount of information loss after pruning. The mathematical relation between the proposed generalized itemsets and the existing closed and maximal itemsets are also given. In pattern clustering, a dual clustering method, known as simultaneous pattern and data clustering, is developed using two common yet very different types of clustering algorithms: hierarchical clustering and k-means clustering. Hierarchical clustering generates the entire clustering hierarchy but it is slow and not scalable. K-means clustering produces only a partition so it is fast and scalable. They can be used to handle most real-world situations (i.e. speed and clustering quality). The new clustering method is able to simultaneously cluster patterns as well as their associated data while maintaining an explicit pattern-data relationship. Such relationship enables subsequent analysis of individual pattern clusters through their associated data clusters. One important analysis on a pattern cluster is pattern summarization. In pattern summarization, to summarize each pattern cluster, a subset of the representative patterns will be selected for the cluster. Again, the system measures how representative a pattern is at the item level and takes into account how the patterns overlap each other. The proposed method, called AreaCover, is extended from the well-known RuleCover algorithm. The relationship between the two methods is given. AreaCover is less prone to yield large, trivial patterns (large patterns may cause summary that is too general and not informative enough), and the resulting summary is more concise (with less duplicated attribute values among summary patterns) and more informative (describing more attribute values in the cluster and have longer summary patterns).
The thesis also covers the implementation of the major ideas outlined in the pattern post-analysis framework in an integrated software system. It ends with a discussion on the experimental results of pattern post-analysis on both synthetic and real-world benchmark data. Compared with the existing systems, the new methodology that this thesis presents stands out, possessing significant and superior characteristics in pattern post-analysis and decision support.
|
64 |
Association Pattern Analysis for Pattern Pruning, Clustering and SummarizationLi, Chung Lam 12 September 2008 (has links)
Automatic pattern mining from databases and the analysis of the discovered patterns for useful information are important and in great demand in science, engineering and business. Today, effective pattern mining methods, such as association rule mining and pattern discovery, have been developed and widely used in various challenging industrial and business applications. These methods attempt to uncover the valuable information trapped in large collections of raw data. The patterns revealed provide significant and useful information for decision makers. Paradoxically, pattern mining itself can produce such huge amounts of data that poses a new knowledge management problem: to tackle thousands or even more patterns discovered and held in a data set. Unlike raw data, patterns often overlap, entangle and interrelate to each other in the databases. The relationship among them is usually complex and the notion of distance between them is difficult to qualify and quantify. Such phenomena pose great challenges to the existing data mining discipline. In this thesis, the analysis of patterns after their discovery by existing pattern mining methods is referred to as pattern post-analysis since the patterns to be analyzed are first discovered.
Due to the overwhelmingly huge volume of discovered patterns in pattern mining, it is virtually impossible for a human user to manually analyze them. Thus, the valuable trapped information in the data is shifted to a large collection of patterns. Hence, to automatically analyze the patterns discovered and present the results in a user-friendly manner such as pattern post-analysis is badly needed. This thesis attempts to solve the problems listed below. It addresses 1) the important factors contributing to the interrelating relationship among patterns and hence more accurate measurements of distances between them; 2) the objective pruning of redundant patterns from the discovered patterns; 3) the objective clustering of the patterns into coherent pattern clusters for better organization; 4) the automatic summarization of each pattern cluster for human interpretation; and 5) the application of pattern post-analysis to large database analysis and data mining.
In this thesis, the conceptualization, theoretical formulation, algorithm design and system development of pattern post-analysis of categorical or discrete-valued data is presented. It starts with presenting a natural dual relationship between patterns and data. The relationship furnishes an explicit one-to-one correspondence between a pattern and its associated data and provides a base for an effective analysis of patterns by relating them back to the data. It then discusses the important factors that differentiate patterns and formulates the notion of distances among patterns using a formal graphical approach. To accurately measure the distances between patterns and their associated data, both the samples and the attributes matched by the patterns are considered. To achieve this, the distance measure between patterns has to account for the differences of their associated data clusters at the attribute value (i.e. item) level. Furthermore, to capture the degree of variation of the items matched by patterns, entropy-based distance measures are developed. It attempts to quantify the uncertainty of the matched items. Such distances render an accurate and robust distance measurement between patterns and their associated data. To understand the properties and behaviors of the new distance measures, the mathematical relation between the new distances and the existing sample-matching distances is analytically derived.
The new pattern distances based on the dual pattern-data relationship and their related concepts are used and adapted to pattern pruning, pattern clustering and pattern summarization to furnish an integrated, flexible and generic framework for pattern post-analysis which is able to meet the challenges of today’s complex real-world problems. In pattern pruning, the system defines the amount of redundancy of a pattern with respect to another pattern at the item level. Such definition generalizes the classical closed itemset pruning and maximal itemset pruning which define redundancy at the sample level. A new generalized itemset pruning method is developed using the new definition. It includes the closed and maximal itemsets as two extreme special cases and provides a control parameter for the user to adjust the tradeoff between the number of patterns being pruned and the amount of information loss after pruning. The mathematical relation between the proposed generalized itemsets and the existing closed and maximal itemsets are also given. In pattern clustering, a dual clustering method, known as simultaneous pattern and data clustering, is developed using two common yet very different types of clustering algorithms: hierarchical clustering and k-means clustering. Hierarchical clustering generates the entire clustering hierarchy but it is slow and not scalable. K-means clustering produces only a partition so it is fast and scalable. They can be used to handle most real-world situations (i.e. speed and clustering quality). The new clustering method is able to simultaneously cluster patterns as well as their associated data while maintaining an explicit pattern-data relationship. Such relationship enables subsequent analysis of individual pattern clusters through their associated data clusters. One important analysis on a pattern cluster is pattern summarization. In pattern summarization, to summarize each pattern cluster, a subset of the representative patterns will be selected for the cluster. Again, the system measures how representative a pattern is at the item level and takes into account how the patterns overlap each other. The proposed method, called AreaCover, is extended from the well-known RuleCover algorithm. The relationship between the two methods is given. AreaCover is less prone to yield large, trivial patterns (large patterns may cause summary that is too general and not informative enough), and the resulting summary is more concise (with less duplicated attribute values among summary patterns) and more informative (describing more attribute values in the cluster and have longer summary patterns).
The thesis also covers the implementation of the major ideas outlined in the pattern post-analysis framework in an integrated software system. It ends with a discussion on the experimental results of pattern post-analysis on both synthetic and real-world benchmark data. Compared with the existing systems, the new methodology that this thesis presents stands out, possessing significant and superior characteristics in pattern post-analysis and decision support.
|
65 |
Large Data Clustering And Classification Schemes For Data MiningBabu, T Ravindra 12 1900 (has links)
Data Mining deals with extracting valid, novel, easily understood by humans, potentially useful and general abstractions from large data. A data is large when number of patterns, number of features per pattern or both are large. Largeness of data is characterized by its size which is beyond the capacity of main memory of a computer. Data Mining is an interdisciplinary field involving database systems, statistics, machine learning, visualization and computational aspects. The focus of data mining algorithms is scalability and efficiency. Large data clustering and classification is an important activity in Data Mining. The clustering algorithms are predominantly iterative requiring multiple scans of dataset, which is very expensive when data is stored on the disk.
In the current work we propose different schemes that have both theoretical validity and practical utility in dealing with such a large data. The schemes broadly encompass data compaction, classification, prototype selection, use of domain knowledge and hybrid intelligent systems. The proposed approaches can be broadly classified as (a) compressing the data by some means in a non-lossy manner; cluster as well as classify the patterns in their compressed form directly through a novel algorithm, (b) compressing the data in a lossy fashion such that a very high degree of compression and abstraction is obtained in terms of 'distinct subsequences'; classify the data in such compressed form to improve the prediction accuracy, (c) with the help of incremental clustering, a lossy compression scheme and rough set approach, obtain simultaneous prototype and feature selection, (d) demonstrate that prototype selection and data-dependent techniques can reduce number of comparisons in multiclass classification scenario using SVMs, and (e) by making use of domain knowledge of the problem and data under consideration, we show that we obtaina very high classification accuracy with less number of iterations with AdaBoost.
The schemes have pragmatic utility. The prototype selection algorithm is incremental, requiring a single dataset scan and has linear time and space requirements. We provide results obtained with a large, high dimensional handwritten(hw) digit data. The compression algorithm is based on simple concepts, where we demonstrate that classification of the compressed data improves computation time required by a factor 5 with prediction accuracy with both compressed and original data being exactly the same as 92.47%. With the proposed lossy compression scheme and pruning methods, we demonstrate that even with a reduction of distinct sequences by a factor of 6 (690 to 106), the prediction accuracy improves. Specifically, with original data containing 690 distinct subsequences, the classification accuracy is 92.47% and with appropriate choice of parameters for pruning, the number of distinct subsequences reduces to 106 with corresponding classification accuracy as 92.92%. The best classification accuracy of 93.3% is obtained with 452 distinct subsequences. With the scheme of simultaneous feature and prototype selection, we improved classification accuracy to better than that obtained with kNNC, viz., 93.58%, while significantly reducing the number of features and prototypes, achieving a compaction of 45.1%. In case of hybrid schemes based on SVM, prototypes and domain knowledge based tree(KB-Tree), we demonstrated reduction in SVM training time by 50% and testing time by about 30% as compared to complete data and improvement of classification accuracy to 94.75%. In case of AdaBoost the classification accuracy is 94.48%, which is better than those obtained with NNC and kNNC on the entire data; the training timing is reduced because of use of prototypes instead of the complete data. Another important aspect of the work is to devise a KB-Tree (with maximum depth of 4), that classifies a 10-category data in just 4 comparisons.
In addition to hw data, we applied the schemes to Network Intrusion Detection Data (10% dataset of KDDCUP99) and demonstrated that the proposed schemes provided less overall cost than the reported values.
|
66 |
Νέοι αλγόριθμοι υπολογιστικής νοημοσύνης και ομαδοποίησης για την εξόρυξη πληροφορίαςΤασουλής, Δημήτρης 10 August 2007 (has links)
Αυτή η Διδακτορική Διατριβή πραγματεύεται το θέμα της ομαδοποίησης δεδομένων (clustering), καθώς και εφαρμογές των τεχνικών αυτών σε πραγματικά προβλήματα. Η παρουσίαση των επιμέρους θεμάτων και αποτελεσμάτων της διατριβής αυτής οργανώνεται ως εξής:
Στο Κεφάλαιο 1 παρέχουμε τον ορισμό της Υπολογιστικής Νοημοσύνης σαν τομέας ερευνάς, και αναλύουμε τα ξεχωριστά τμήματα που τον αποτελούν. Για κάθε ένα από αυτά παρουσιάζεται μια σύντομη περιγραφή.
Το Κεφάλαιο 2, ασχολείται με την ανάλυση του ερευνητικού πεδίου της ομαδοποίησης. Κάθε ένα από τα χαρακτηριστικά της αναλύεται ξεχωριστά και γίνεται μια επισκόπηση των σημαντικότερων αλγόριθμων ομαδοποίησης.
Το Κεφάλαιο 3, αφιερώνεται στη παρουσίαση του αλγορίθμου UKW, που κατά την εκτέλεση του έχει την ικανότητα να προσεγγίζει το πλήθος των ομάδων σε ένα σύνολο δεδομένων. Επίσης παρουσιάζονται πειραματικά αποτελέσματα με σκοπό τη μελέτη της απόδοσης του αλγορίθμου.
Στο Κεφάλαιο 4, προτείνεται μια επέκταση του αλγορίθμου UKW, σε μετρικούς χώρους. Η προτεινόμενη επέκταση διατηρεί όλα τα πλεονεκτήματα του αλγορίθμου UKW. Τα πειραματικά αποτελέσματα που παρουσιάζονται επίσης σε αυτό το κεφάλαιο, συγκρίνουν την προτεινόμενη επέκταση με άλλους αλγορίθμους.
Στο επόμενο κεφάλαιο παρουσιάζουμε τροποποιήσεις του αλγορίθμου με στόχο την βελτίωση των αποτελεσμάτων του. Οι προτεινόμενες τροποποιήσεις αξιοποιούν πληροφορία από τα τοπικά χαρακτηριστικά των δεδομένων, ώστε να κατευθύνουν όσο το δυνατόν καλύτερα την αλγοριθμική διαδικασία.
Το Κεφάλαιο 6, πραγματεύεται επεκτάσεις του αλγορίθμου σε κατανεμημένες Βάσεις δεδομένων. Για τις διάφορες υποθέσεις που μπορούν να γίνουν όσον αφορά τη φύση του περιβάλλοντος επικοινωνίας, παρουσιάζονται κατάλληλοι αλγόριθμοι.
Στο Κεφάλαιο 7, εξετάζουμε την περίπτωση δυναμικών βάσεων δεδομένων. Σε ένα τέτοιο μη στατικό περιβάλλον αναπτύσσεται μια επέκταση του αλγορίθμου UKW, που ενσωματώνει τη δυναμική δομή δεικτοδότησης Bkd-tree. Επιπλέον παρουσιάζονται θεωρητικά αποτελέσματα για την πολυπλοκότητα χειρότερης περίπτωσης του αλγορίθμου.
Το Κεφάλαιο 8, μελετά την εφαρμογή αλγορίθμων ομαδοποίησης σε δεδομένα γονιδιακών εκφράσεων. Επίσης προτείνεται και αξιολογείται ένα υβριδικό σχήμα που καταφέρνει να αυτοματοποιήσει την όλη διαδικασία επιλογής γονιδίων και ομαδοποίησης.
Τέλος, η παρουσίαση του ερευνητικού έργου αυτής της διατριβής ολοκληρώνεται στο Κεφάλαιο 9 που ασχολείται με την ανάπτυξη υβριδικών τεχνικών που συνδυάζουν την ομαδοποίηση και τα Τεχνητά Νευρωνικά Δίκτυα, και αναδεικνύει τις δυνατότητες τους σε δύο πραγματικά προβλήματα. / This Doctoral Dissertation appoints the issue of data Clustering, as well as the applications of these kind of methods in real world problems. The presentation of the individual results of this dissertation is organised as follows:
In Chapter 1, the definition of Computational Intelligence is provided as a research area. For each distinct part of this area a short description is supplied.
Chapter 2, deals with the analysis of the research area of Clustering per se, and its characteristics are analysed separably. Moreover, we provide a review of the most representative clustering algorithms.
Chapter 3, is devoted to the presentation of the UKW algorithm, that is able to endogenously provide approximations for the number of clusters in a dataset, during its execution. Furthermore, the included experimental results demonstrate the algorithm's efficiency.
In Chapter 4, an extension of the UKW algorithm to metric spaces is proposed. This extension preserves all the advantages of the original algorithm. The included experimental results compare the proposed extension to other approaches.
In the next chapter we present modifications of the UKW algorithm that scope to improve its efficiency. This is performed through the utilisation of information from the local characteristics of the data, so as to direct more efficiently the whole clustering procedure.
Chapter 6, deals with extensions of the algorithm in distributed data bases. For the various assumptions that can be postulated for the nature of the communication environment different algorithms are proposed.
In Chapter 7, we consider the case of dynamic databases. In such a non-static environment, an algorithm is developed that draws form the principles of the UKW algorithm, and embodies the dynamic indexing Bkd-tree data structure. Moreover, theoretical results are presented regarding the worst case complexity of the algorithm.
Chapter 8, studies the application of clustering algorithms in gene expression data. Besides, it is proposed and evaluated, a hybrid algorithmic scheme that manages to automate the whole procedure of gene selection and clustering.
Finally, the presentation of the research work of this dissertation is fulfilled in Chapter 9. This Chapter is devoted to the development of hybrid techniques that combine clustering methods and Artificial Neural Networks, and demonstrate their abilities in two real world problems.
|
67 |
IT žinių portalo statistikos modulis pagrįstas grupavimu / Portal Statistics Module Based on ClusteringRuzgys, Martynas 16 August 2007 (has links)
Pristatomas duomenų gavybos ir grupavimo naudojimas paplitusiose sistemose bei sukurtas IT žinių portalo statistikos prototipas duomenų saugojimui, analizei ir peržiūrai atlikti. Siūlomas statistikos modulis duomenų saugykloje periodiškais laiko momentais vykdantis duomenų transformacijas. Portale prieinami statistiniai duomenys gali būti grupuoti. Sugrupuotą informaciją pateikus grafiškai, duomenys gali būti interpretuojami ir stebimi veiklos mastai. Panašių objektų grupėms išskirti pritaikytas vienas iš žinomiausių duomenų grupavimo metodų – lygiagretusis k-vidurkių metodas. / Presented data mining methods and clustering usage in current statistical systems and created statistics module prototype for data storage, analysis and visualization for IT knowledge portal. In suggested statistics prototype database periodical data transformations are performed. Statistical data accessed in portal can be clustered. Clustered information represented graphically may serve for interpreting information when trends may be noticed. One of the best known data clustering methods – parallel k-means method – is adapted for separating similar data clusters.
|
68 |
Suivi visuel multi-cibles par partitionnement de détections : application à la construction d'albums de visages / Visual tracking multi-target detections by partitioning : Application to construction albums of facesSchwab, Siméon 08 July 2013 (has links)
Ce mémoire décrit mes travaux de thèse menés au sein de l'équipe ComSee (Computers that See) rattachée à l'axe ISPR (Image, Systèmes de Perception et Robotique) de l'Institut Pascal. Celle-ci a été financée par la société Vesalis par le biais d'une convention CIFRE avec l'Institut Pascal, subventionnée par l'ANRT (Association Nationale de la Recherche et de la Technologie). Les travaux de thèse s'inscrivent dans le cadre de l'automatisation de la fouille d'archives vidéo intervenant lors d'enquêtes policières. L'application rattachée à cette thèse concerne la création automatique d'un album photo des individus apparaissant sur une séquence de vidéosurveillance. En s'appuyant sur un détecteur de visages, l'objectif est de regrouper par identité les visages détectés sur l'ensemble d'une séquence vidéo. Comme la reconnaissance faciale en environnement non-contrôlé reste difficilement exploitable, les travaux se sont orientés vers le suivi visuel multi-cibles global basé détections. Ce type de suivi est relativement récent. Il fait intervenir un détecteur d'objets et traite la vidéo dans son ensemble (en opposition au traitement séquentiel couramment utilisé). Cette problématique a été représentée par un modèle probabiliste de type Maximum A Posteriori. La recherche de ce maximum fait intervenir un algorithme de circulation de flot sur un graphe, issu de travaux antérieurs. Ceci permet l'obtention d'une solution optimale au problème (défini par l'a posteriori) du regroupement des détections pour le suivi. L'accent a particulièrement été mis sur la représentation de la similarité entre les détections qui s'intègre dans le terme de vraisemblance du modèle. Plusieurs mesures de similarités s'appuyant sur différents indices (temps, position dans l'image, apparence et mouvement local) ont été testées. Une méthode originale d'estimation de ces similarités entre les visages détectés a été développée pour fusionner les différentes informations et s'adapter à la situation rencontrée. Plusieurs expérimentations ont été menées sur des situations complexes, mais réalistes, de scènes de vidéosurveillance. Même si les qualités des albums construits ne satisfont pas encore à une utilisation pratique, le système de regroupement de détections mis en œuvre au cours de cette thèse donne déjà une première solution. Grâce au point de vue partitionnement de données adopté au cours de cette thèse, le suivi multi-cibles développé permet une extension simple à du suivi autre que celui des visages. / This report describes my thesis work conducted within the ComSee (Computers That See) team related to the ISPR axis (ImageS, Perception Systems and Robotics) of Institut Pascal. It was financed by the Vesalis company via a CIFRE (Research Training in Industry Convention) agreement with Institut Pascal and publicly funded by ANRT (National Association of Research and Technology). The thesis was motivated by issues related to automation of video analysis encountered during police investigations. The theoretical research carried out in this thesis is applied to the automatic creation of a photo album summarizing people appearing in a CCTV sequence. Using a face detector, the aim is to group by identity all the faces detected throughout the whole video sequence. As the use of facial recognition techniques in unconstrained environments remains unreliable, we have focused instead on global multi-target tracking based on detections. This type of tracking is relatively recent. It involves an object detector and global processing of the video (as opposed to sequential processing commonly used). This issue has been represented by a Maximum A Posteriori probabilistic model. To find an optimal solution of Maximum A Posteriori formulation, we use a graph-based network flow approach, built upon third-party research. The study concentrates on the definition of inter-detections similarities related to the likelihood term of the model. Multiple similarity metrics based on different clues (time, position in the image, appearance and local movement) were tested. An original method to estimate these similarities was developed to merge these various clues and adjust to the encountered situation. Several experiments were done on challenging but real-world situations which may be gathered from CCTVs. Although the quality of generated albums do not yet satisfy practical use, the detections clustering system developed in this thesis provides a good initial solution. Thanks to the data clustering point of view adopted in this thesis, the proposed detection-based multi-target tracking allows easy transfer to other tracking domains.
|
69 |
Agrupamento de dados semissupervisionado na geração de regras fuzzyLopes, Priscilla de Abreu 27 August 2010 (has links)
Submitted by Izabel Franco (izabel-franco@ufscar.br) on 2016-09-06T18:25:30Z
No. of bitstreams: 1
DissPAL.pdf: 2245333 bytes, checksum: 24abfad37e7d0675d6cef494f4f41d1e (MD5) / Approved for entry into archive by Marina Freitas (marinapf@ufscar.br) on 2016-09-12T14:03:53Z (GMT) No. of bitstreams: 1
DissPAL.pdf: 2245333 bytes, checksum: 24abfad37e7d0675d6cef494f4f41d1e (MD5) / Approved for entry into archive by Marina Freitas (marinapf@ufscar.br) on 2016-09-12T14:04:01Z (GMT) No. of bitstreams: 1
DissPAL.pdf: 2245333 bytes, checksum: 24abfad37e7d0675d6cef494f4f41d1e (MD5) / Made available in DSpace on 2016-09-12T14:04:09Z (GMT). No. of bitstreams: 1
DissPAL.pdf: 2245333 bytes, checksum: 24abfad37e7d0675d6cef494f4f41d1e (MD5)
Previous issue date: 2010-08-27 / Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES) / Inductive learning is, traditionally, categorized as supervised and unsupervised.
In supervised learning, the learning method is given a labeled data set (classes
of data are known). Those data sets are adequate for problems of classification
and regression. In unsupervised learning, unlabeled data are analyzed in order to
identify structures embedded in data sets.
Typically, clustering methods do not make use of previous knowledge, such as
classes labels, to execute their job. The characteristics of recently acquired data
sets, great volume and mixed attribute structures, contribute to research on better
solutions for machine learning jobs.
The proposed research fits into this context. It is about semi-supervised fuzzy
clustering applied to the generation of sets of fuzzy rules. Semi-supervised clustering
does its job by embodying some previous knowledge about the data set. The
clustering results are, then, useful for labeling the remaining unlabeled data in the
set. Following that, come to action the supervised learning algorithms aimed at
generating fuzzy rules.
This document contains theoretic concepts, that will help in understanding the
research proposal, and a discussion about the context wherein is the proposal.
Some experiments were set up to show that this may be an interesting solution for
machine learning jobs that have encountered difficulties due to lack of available
information about data. / O aprendizado indutivo é, tradicionalmente, dividido em supervisionado e não
supervisionado. No aprendizado supervisionado é fornecido ao método de aprendizado
um conjunto de dados rotulados (dados que tem a classe conhecida). Estes
dados são adequados para problemas de classificação e regressão. No aprendizado
não supervisionado são analisados dados não rotulados, com o objetivo de
identificar estruturas embutidas no conjunto.
Tipicamente, métodos de agrupamento não se utilizam de conhecimento prévio,
como rótulos de classes, para desempenhar sua tarefa. A característica de conjuntos
de dados atuais, grande volume e estruturas de atributos mistas, contribui
para a busca de melhores soluções para tarefas de aprendizado de máquina.
É neste contexto em que se encaixa esta proposta de pesquisa. Trata-se da
aplicação de métodos de agrupamento fuzzy semi-supervisionados na geração de
bases de regras fuzzy. Os métodos de agrupamento semi-supervisionados realizam
sua tarefa incorporando algum conhecimento prévio a respeito do conjunto de dados.
O resultado do agrupamento é, então, utilizado para rotulação do restante do
conjunto. Em seguida, entram em ação algoritmos de aprendizado supervisionado
que tem como objetivo gerar regras fuzzy.
Este documento contém conceitos teóricos para compreensão da proposta de
trabalho e uma discussão a respeito do contexto onde se encaixa a proposta. Alguns
experimentos foram realizados a fim de mostrar que esta pode ser uma solução
interessante para tarefas de aprendizado de máquina que encontram dificuldades
devido à falta de informação disponível sobre dados.
|
70 |
MCAC - Monte Carlo Ant Colony: um novo algoritmo estocástico de agrupamento de dadosAGUIAR, José Domingos Albuquerque 29 February 2008 (has links)
Submitted by (ana.araujo@ufrpe.br) on 2016-07-06T19:39:45Z
No. of bitstreams: 1
Jose Domingos Albuquerque Aguiar.pdf: 818824 bytes, checksum: 7c15525f356ca47ab36ddd8ac61ebd31 (MD5) / Made available in DSpace on 2016-07-06T19:39:45Z (GMT). No. of bitstreams: 1
Jose Domingos Albuquerque Aguiar.pdf: 818824 bytes, checksum: 7c15525f356ca47ab36ddd8ac61ebd31 (MD5)
Previous issue date: 2008-02-29 / In this work we present a new data cluster algorithm based on social behavior of ants which applies Monte Carlo simulations in selecting the maximum path length of the ants. We compare the performance of the new method with the popular k-means and another algorithm also inspired by the social ant behavior. For the comparative study we employed three data sets from the real world, three deterministic artificial data sets and two random generated data sets, yielding a total of eight data sets. We find that the new algorithm outperforms the others in all studied cases but one. We also address the issue concerning about the right number of groups in a particular data set. Our results show that the proposed algorithm yields a good estimate for the right number of groups present in the data set. / Esta dissertação apresenta um algoritmo inédito de agrupamento de dados que têm como fundamentos o método de Monte Carlo e uma heurística que se baseia no comportamento social das formigas, conhecida como Otimização por Colônias de Formigas. Neste trabalho realizou-se um estudo comparativo do novo algoritmo com outros dois algoritmos de agrupamentos de dados. O primeiro algoritmo é o KMédias que é muito conhecido entre os pesquisadores. O segundo é um algoritmo que utiliza a Otimização por Colônias de Formigas juntamente com um híbrido de outros métodos de otimização. Para implementação desse estudo comparativo utilizaram-se oito conjuntos de dados sendo três conjuntos de dados reais, dois artificiais gerados deterministicamente e três artificiais gerados aleatoriamente. Os resultados do estudo comparativo demonstram que o novo algoritmo identifica padrões nas massas de dados, com desempenho igual ou superior aos outros dois algoritmos avaliados. Neste trabalho investigou-se também a capacidade do novo algoritmo em identificar o número de grupos existentes nos conjuntos dados. Os resultados dessa investigação mostram que o novo algoritmo é capaz de identificar o de número provável de grupos existentes dentro do conjunto de dados.
|
Page generated in 0.455 seconds