Global ETD Search

11	Data Mining in a Multidimensional Environment Günzel, Holger, Albrecht, Jens, Lehner, Wolfgang 12 January 2023 (has links) Data Mining and Data Warehousing are two hot topics in the database research area. Until recently, conventional data mining algorithms were primarily developed for a relational environment. But a data warehouse database is based on a multidimensional model. In our paper we apply this basis for a seamless integration of data mining in the multidimensional model for the example of discovering association rules. Furthermore, we propose this method as a userguided technique because of the clear structure both of model and data. We present both the theoretical basis and efficient algorithms for data mining in the multidimensional data model. Our approach uses directly the requirements of dimensions, classifications and sparsity of the cube. Additionally we give heuristics for optimizing the search for rules. info:eu-repo/classification/ddc/004 ddc:004
12	Frequent Itemset Hiding Algorithm Using Frequent Pattern Tree Approach Alnatsheh, Rami H. 01 January 2012 (has links) A problem that has been the focus of much recent research in privacy preserving data-mining is the frequent itemset hiding (FIH) problem. Identifying itemsets that appear together frequently in customer transactions is a common task in association rule mining. Organizations that share data with business partners may consider some of the frequent itemsets sensitive and aim to hide such sensitive itemsets by removing items from certain transactions. Since such modifications adversely affect the utility of the database for data mining applications, the goal is to remove as few items as possible. Since the frequent itemset hiding problem is NP-hard and practical instances of this problem are too large to be solved optimally, there is a need for heuristic methods that provide good solutions. This dissertation developed a new method called Min_Items_Removed, using the Frequent Pattern Tree (FP-Tree) that outperforms extant methods for the FIH problem. The FP-Tree enables the compression of large databases into significantly smaller data structures. As a result of this compression, a search may be performed with increased speed and efficiency. To evaluate the effectiveness and performance of the Min_Items_Removed algorithm, eight experiments were conducted. The results showed that the Min_Items_Removed algorithm yields better quality solutions than extant methods in terms of minimizing the number of removed items. In addition, the results showed that the newly introduced metric (normalized number of leaves) is a very good indicator of the problem size or difficulty of the problem instance that is independent of the number of sensitive itemsets. FIH Frequent itemset hiding Frequent pattern tree Hiding patterns Sanitization Sensitive itemsets Computer Sciences
13	[pt] MINERAÇÃO DE ITENS FREQUENTES EM SEQUÊNCIAS DE DADOS: UMA IMPLEMENTAÇÃO EFICIENTE USANDO VETORES DE BITS / [en] MINING FREQUENT ITEMSETS IN DATA STREAMS: AN EFFICIENT IMPLEMENTATION USING BIT VECTORS FRANKLIN ANDERSON DE AMORIM 11 February 2016 (has links) [pt] A mineração de conjuntos de itens frequentes em sequências de dados possui diversas aplicações práticas como, por exemplo, análise de comportamento de usuários, teste de software e pesquisa de mercado. Contudo, a grande quantidade de dados gerada pode representar um obstáculo para o processamento dos mesmos em tempo real e, consequentemente, na sua análise e tomada de decisão. Sendo assim, melhorias na eficiência dos algoritmos usados para estes fins podem trazer grandes benefícios para os sistemas que deles dependem. Esta dissertação apresenta o algoritmo MFI-TransSWmais, uma versão otimizada do algoritmo MFI-TransSW, que utiliza vetores de bits para processar sequências de dados em tempo real. Além disso, a dissertação descreve a implementação de um sistema de recomendação de matérias jornalísticas, chamado ClickRec, baseado no MFI-TransSWmais, para demonstrar o uso da nova versão do algoritmo. Por último, a dissertação descreve experimentos com dados reais e apresenta resultados da comparação de performance dos dois algoritmos e dos acertos do sistema de recomendações ClickRec. / [en] The mining of frequent itemsets in data streams has several practical applications, such as user behavior analysis, software testing and market research. Nevertheless, the massive amount of data generated may pose an obstacle to processing then in real time and, consequently, in their analysis and decision making. Thus, improvements in the efficiency of the algorithms used for these purposes may bring great benefits for systems that depend on them. This thesis presents the MFI-TransSWplus algorithm, an optimized version of MFI-TransSW algorithm, which uses bit vectors to process data streams in real time. In addition, this thesis describes the implementation of a news articles recommendation system, called ClickRec, based on the MFI-TransSWplus, to demonstrate the use of the new version of the algorithm. Finally, the thesis describes experiments with real data and presents results of performance and a comparison between the two algorithms in terms of performance and the hit rate of the ClickRec recommendation system. [pt] MINERACAO DE DADOS [pt] CONJUNTOS DE ITENS FREQUENTES [pt] SEQUENCIAS DE DADOS [en] DATA MINING [en] FREQUENT ITEMSETS [en] DATASTREAM
14	Extraction de connaissances : réunir volumes de données et motifs significatifs Masseglia, Florent 27 November 2009 (has links) (PDF) L'analyse et la fouille des données d'usages sont indissociables de la notion d'évolution dynamique. Considérons le cas des sites Web, par exemple. Le dynamisme des usages sera lié au dynamisme des pages qui les concernent. Si une page est créée, et qu'elle présente de l'intérêt pour les utilisateurs, alors elle sera consultée. Si la page n'est plus d'actualité, alors les consultations vont baisser ou disparaître. C'est le cas, par exemple, des pages Web de conférences scientifiques qui voient des pics successifs de consultation lorsque les appels à communications sont diffusés, puis le jour de la date limite d'envoi des résumés, puis le jour de la date limite d'envoi des articles. Dans ce mémoire d'habilitation à diriger des recherches, je propose une synthèse des travaux que j'ai dirigés ou co-dirigés, en me basant sur des extraits de publications issues de ces travaux. La première contribution concerne les difficultés d'un processus de fouille de données basé sur le support minimum. Ces difficultés viennent en particulier des supports très bas, à partir desquels des connaissances utiles commencent à apparaître. Ensuite, je proposerai trois déclinaisons de cette notion d'évolution dans l'analyse des usages : l'évolution en tant que connaissance (des motifs qui expriment l'évolution) ; l'évolution des données (en particulier dans le traitement des flux de données) ; et l'évolution des comportements malicieux et des techniques de défense. [INFO:INFO_LG] Computer Science/Learning Fouille de données Flux de données Motifs séquentiels Itemsets
15	[en] APPLYING PROCESS MINING TO THE ACADEMIC ADMINISTRATION DOMAIN / [pt] APLICAÇÃO DE MINERAÇÃO DE PROCESSOS AO DOMÍNIO ACADÊMICO ADMINISTRATIVO HAYDÉE GUILLOT JIMÉNEZ 12 December 2017 (has links) [pt] As instituições de ensino superior mantêm uma quantidade considerável de dados que incluem tanto os registros dos alunos como a estrutura dos currículos dos cursos de graduação. Este trabalho, adotando uma abordagem de mineração de processos, centra-se no problema de identificar quão próximo os alunos seguem a ordem recomendada das disciplinas em um currículo de graduação, e até que ponto o desempenho de cada aluno é afetado pela ordem que eles realmente adotam. O problema é abordado aplicando-se duas técnicas já existentes aos registros dos alunos: descoberta de processos e verificação de conformidade; e frequência de conjuntos de itens. Finalmente, a dissertação cobre experimentos realizados aplicando-se essas técnicas a um estudo de caso com mais de 60.000 registros de alunos da PUC-Rio. Os experimentos indicam que a técnica de frequência de conjuntos de itens produz melhores resultados do que as técnicas de descoberta de processos e verificação de conformidade. E confirmam igualmente a relevância de análises baseadas na abordagem de mineração de processos para ajudar coordenadores acadêmicos na busca de melhores currículos universitários. / [en] Higher Education Institutions keep a sizable amount of data, including student records and the structure of degree curricula. This work, adopting a process mining approach, focuses on the problem of identifying how closely students follow the recommended order of the courses in a degree curriculum, and to what extent their performance is affected by the order they actually adopt. It addresses this problem by applying to student records two already existing techniques: process discovery and conformance checking, and frequent itemsets. Finally, the dissertation covers experiments performed by applying these techniques to a case study involving over 60,000 student records from PUC-Rio. The experiments show that the frequent itemsets technique performs better than the process discovery and conformance checking techniques. They equally confirm the relevance of analyses based on the process mining approach to help academic coordinators in their quest for better degree curricula. [pt] ORGANIZACAO CURRICULAR [en] CURRICULUM ORGANIZATION [pt] ANALISE ACADEMICA [en] ACADEMIC ANALYTICS [pt] FREQUENCIA DE CONJUNTOS [en] FREQUENT ITEMSETS [pt] MINERACAO DE PROCESSOS [en] PROCESS MINING
16	Algorithmes pour la fouille de données et la bio-informatique / Algorithms for data mining and bio-informatics Mondal, Kartick Chandra 12 July 2013 (has links) L'extraction de règles d'association et de bi-clusters sont deux techniques de fouille de données complémentaires majeures, notamment pour l'intégration de connaissances. Ces techniques sont utilisées dans de nombreux domaines, mais aucune approche permettant de les unifier n'a été proposée. Hors, réaliser ces extractions indépendamment pose les problèmes des ressources nécessaires (mémoire, temps d'exécution et accès aux données) et de l'unification des résultats. Nous proposons une approche originale pour extraire différentes catégories de modèles de connaissances tout en utilisant un minimum de ressources. Cette approche est basée sur la théorie des ensembles fermés et utilise une nouvelle structure de données pour extraire des représentations conceptuelles minimales de règles d'association, bi-clusters et règles de classification. Ces modèles étendent les règles d'association et de classification et les bi-clusters classiques, les listes d'objets supportant chaque modèle et les relations hiérarchiques entre modèles étant également extraits. Cette approche a été appliquée pour l'analyse de données d'interaction protéomiques entre le virus VIH-1 et l'homme. L'analyse de ces interactions entre espèces est un défi majeur récent en bio-informatique. Plusieurs bases de données intégrant des informations hétérogènes sur les interactions et des connaissances biologiques sur les protéines ont été construites. Les résultats expérimentaux montrent que l'approche proposée peut traiter efficacement ces bases de données et que les modèles conceptuels extraits peuvent aider à la compréhension et à l'analyse de la nature des relations entre les protéines interagissant. / Knowledge pattern extraction is one of the major topics in the data mining and background knowledge integration domains. Out of several data mining techniques, association rule mining and bi-clustering are two major complementary tasks for these topics. These tasks gained much importance in many domains in recent years. However, no approach was proposed to perform them in one process. This poses the problems of resources required (memory, execution times and data accesses) to perform independent extractions and of the unification of the different results. We propose an original approach for extracting different categories of knowledge patterns while using minimum resources. This approach is based on the frequent closed patterns theoretical framework and uses a novel suffix-tree based data structure to extract conceptual minimal representations of association rules, bi-clusters and classification rules. These patterns extend the classical frameworks of association and classification rules, and bi-clusters as data objects supporting each pattern and hierarchical relationships between patterns are also extracted. This approach was applied to the analysis of HIV-1 and human protein-protein interaction data. Analyzing such inter-species protein interactions is a recent major challenge in computational biology. Databases integrating heterogeneous interaction information and biological background knowledge on proteins have been constructed. Experimental results show that the proposed approach can efficiently process these databases and that extracted conceptual patterns can help the understanding and analysis of the nature of relationships between interacting proteins. Bases de règles d'association Règles de classification Règles d'association conceptuelles Itemsets fermés fréquents Treillis des itemsets fermés Connexion de galois Analyse de concepts formels Structures de données Arbres suffixés Data mining Knowledge discovery in database Bases of association rules Classification rules Conceptual association rules Bi-clustering Frequent closed itemsets Closed itemset lattice Galois connection Formal concept analysis Suffix-tree data structure
17	Association Pattern Analysis for Pattern Pruning, Clustering and Summarization Li, Chung Lam 12 September 2008 (has links) Automatic pattern mining from databases and the analysis of the discovered patterns for useful information are important and in great demand in science, engineering and business. Today, effective pattern mining methods, such as association rule mining and pattern discovery, have been developed and widely used in various challenging industrial and business applications. These methods attempt to uncover the valuable information trapped in large collections of raw data. The patterns revealed provide significant and useful information for decision makers. Paradoxically, pattern mining itself can produce such huge amounts of data that poses a new knowledge management problem: to tackle thousands or even more patterns discovered and held in a data set. Unlike raw data, patterns often overlap, entangle and interrelate to each other in the databases. The relationship among them is usually complex and the notion of distance between them is difficult to qualify and quantify. Such phenomena pose great challenges to the existing data mining discipline. In this thesis, the analysis of patterns after their discovery by existing pattern mining methods is referred to as pattern post-analysis since the patterns to be analyzed are first discovered. Due to the overwhelmingly huge volume of discovered patterns in pattern mining, it is virtually impossible for a human user to manually analyze them. Thus, the valuable trapped information in the data is shifted to a large collection of patterns. Hence, to automatically analyze the patterns discovered and present the results in a user-friendly manner such as pattern post-analysis is badly needed. This thesis attempts to solve the problems listed below. It addresses 1) the important factors contributing to the interrelating relationship among patterns and hence more accurate measurements of distances between them; 2) the objective pruning of redundant patterns from the discovered patterns; 3) the objective clustering of the patterns into coherent pattern clusters for better organization; 4) the automatic summarization of each pattern cluster for human interpretation; and 5) the application of pattern post-analysis to large database analysis and data mining. In this thesis, the conceptualization, theoretical formulation, algorithm design and system development of pattern post-analysis of categorical or discrete-valued data is presented. It starts with presenting a natural dual relationship between patterns and data. The relationship furnishes an explicit one-to-one correspondence between a pattern and its associated data and provides a base for an effective analysis of patterns by relating them back to the data. It then discusses the important factors that differentiate patterns and formulates the notion of distances among patterns using a formal graphical approach. To accurately measure the distances between patterns and their associated data, both the samples and the attributes matched by the patterns are considered. To achieve this, the distance measure between patterns has to account for the differences of their associated data clusters at the attribute value (i.e. item) level. Furthermore, to capture the degree of variation of the items matched by patterns, entropy-based distance measures are developed. It attempts to quantify the uncertainty of the matched items. Such distances render an accurate and robust distance measurement between patterns and their associated data. To understand the properties and behaviors of the new distance measures, the mathematical relation between the new distances and the existing sample-matching distances is analytically derived. The new pattern distances based on the dual pattern-data relationship and their related concepts are used and adapted to pattern pruning, pattern clustering and pattern summarization to furnish an integrated, flexible and generic framework for pattern post-analysis which is able to meet the challenges of today’s complex real-world problems. In pattern pruning, the system defines the amount of redundancy of a pattern with respect to another pattern at the item level. Such definition generalizes the classical closed itemset pruning and maximal itemset pruning which define redundancy at the sample level. A new generalized itemset pruning method is developed using the new definition. It includes the closed and maximal itemsets as two extreme special cases and provides a control parameter for the user to adjust the tradeoff between the number of patterns being pruned and the amount of information loss after pruning. The mathematical relation between the proposed generalized itemsets and the existing closed and maximal itemsets are also given. In pattern clustering, a dual clustering method, known as simultaneous pattern and data clustering, is developed using two common yet very different types of clustering algorithms: hierarchical clustering and k-means clustering. Hierarchical clustering generates the entire clustering hierarchy but it is slow and not scalable. K-means clustering produces only a partition so it is fast and scalable. They can be used to handle most real-world situations (i.e. speed and clustering quality). The new clustering method is able to simultaneously cluster patterns as well as their associated data while maintaining an explicit pattern-data relationship. Such relationship enables subsequent analysis of individual pattern clusters through their associated data clusters. One important analysis on a pattern cluster is pattern summarization. In pattern summarization, to summarize each pattern cluster, a subset of the representative patterns will be selected for the cluster. Again, the system measures how representative a pattern is at the item level and takes into account how the patterns overlap each other. The proposed method, called AreaCover, is extended from the well-known RuleCover algorithm. The relationship between the two methods is given. AreaCover is less prone to yield large, trivial patterns (large patterns may cause summary that is too general and not informative enough), and the resulting summary is more concise (with less duplicated attribute values among summary patterns) and more informative (describing more attribute values in the cluster and have longer summary patterns). The thesis also covers the implementation of the major ideas outlined in the pattern post-analysis framework in an integrated software system. It ends with a discussion on the experimental results of pattern post-analysis on both synthetic and real-world benchmark data. Compared with the existing systems, the new methodology that this thesis presents stands out, possessing significant and superior characteristics in pattern post-analysis and decision support. Pattern clustering Pattern pruning Pattern summarization Pattern discovery Simultaneous pattern and data clustering Pattern post-analysis Interestingness measures Association rules Itemsets Electrical and Computer Engineering
18	Association Pattern Analysis for Pattern Pruning, Clustering and Summarization Li, Chung Lam 12 September 2008 (has links) Automatic pattern mining from databases and the analysis of the discovered patterns for useful information are important and in great demand in science, engineering and business. Today, effective pattern mining methods, such as association rule mining and pattern discovery, have been developed and widely used in various challenging industrial and business applications. These methods attempt to uncover the valuable information trapped in large collections of raw data. The patterns revealed provide significant and useful information for decision makers. Paradoxically, pattern mining itself can produce such huge amounts of data that poses a new knowledge management problem: to tackle thousands or even more patterns discovered and held in a data set. Unlike raw data, patterns often overlap, entangle and interrelate to each other in the databases. The relationship among them is usually complex and the notion of distance between them is difficult to qualify and quantify. Such phenomena pose great challenges to the existing data mining discipline. In this thesis, the analysis of patterns after their discovery by existing pattern mining methods is referred to as pattern post-analysis since the patterns to be analyzed are first discovered. Due to the overwhelmingly huge volume of discovered patterns in pattern mining, it is virtually impossible for a human user to manually analyze them. Thus, the valuable trapped information in the data is shifted to a large collection of patterns. Hence, to automatically analyze the patterns discovered and present the results in a user-friendly manner such as pattern post-analysis is badly needed. This thesis attempts to solve the problems listed below. It addresses 1) the important factors contributing to the interrelating relationship among patterns and hence more accurate measurements of distances between them; 2) the objective pruning of redundant patterns from the discovered patterns; 3) the objective clustering of the patterns into coherent pattern clusters for better organization; 4) the automatic summarization of each pattern cluster for human interpretation; and 5) the application of pattern post-analysis to large database analysis and data mining. In this thesis, the conceptualization, theoretical formulation, algorithm design and system development of pattern post-analysis of categorical or discrete-valued data is presented. It starts with presenting a natural dual relationship between patterns and data. The relationship furnishes an explicit one-to-one correspondence between a pattern and its associated data and provides a base for an effective analysis of patterns by relating them back to the data. It then discusses the important factors that differentiate patterns and formulates the notion of distances among patterns using a formal graphical approach. To accurately measure the distances between patterns and their associated data, both the samples and the attributes matched by the patterns are considered. To achieve this, the distance measure between patterns has to account for the differences of their associated data clusters at the attribute value (i.e. item) level. Furthermore, to capture the degree of variation of the items matched by patterns, entropy-based distance measures are developed. It attempts to quantify the uncertainty of the matched items. Such distances render an accurate and robust distance measurement between patterns and their associated data. To understand the properties and behaviors of the new distance measures, the mathematical relation between the new distances and the existing sample-matching distances is analytically derived. The new pattern distances based on the dual pattern-data relationship and their related concepts are used and adapted to pattern pruning, pattern clustering and pattern summarization to furnish an integrated, flexible and generic framework for pattern post-analysis which is able to meet the challenges of today’s complex real-world problems. In pattern pruning, the system defines the amount of redundancy of a pattern with respect to another pattern at the item level. Such definition generalizes the classical closed itemset pruning and maximal itemset pruning which define redundancy at the sample level. A new generalized itemset pruning method is developed using the new definition. It includes the closed and maximal itemsets as two extreme special cases and provides a control parameter for the user to adjust the tradeoff between the number of patterns being pruned and the amount of information loss after pruning. The mathematical relation between the proposed generalized itemsets and the existing closed and maximal itemsets are also given. In pattern clustering, a dual clustering method, known as simultaneous pattern and data clustering, is developed using two common yet very different types of clustering algorithms: hierarchical clustering and k-means clustering. Hierarchical clustering generates the entire clustering hierarchy but it is slow and not scalable. K-means clustering produces only a partition so it is fast and scalable. They can be used to handle most real-world situations (i.e. speed and clustering quality). The new clustering method is able to simultaneously cluster patterns as well as their associated data while maintaining an explicit pattern-data relationship. Such relationship enables subsequent analysis of individual pattern clusters through their associated data clusters. One important analysis on a pattern cluster is pattern summarization. In pattern summarization, to summarize each pattern cluster, a subset of the representative patterns will be selected for the cluster. Again, the system measures how representative a pattern is at the item level and takes into account how the patterns overlap each other. The proposed method, called AreaCover, is extended from the well-known RuleCover algorithm. The relationship between the two methods is given. AreaCover is less prone to yield large, trivial patterns (large patterns may cause summary that is too general and not informative enough), and the resulting summary is more concise (with less duplicated attribute values among summary patterns) and more informative (describing more attribute values in the cluster and have longer summary patterns). The thesis also covers the implementation of the major ideas outlined in the pattern post-analysis framework in an integrated software system. It ends with a discussion on the experimental results of pattern post-analysis on both synthetic and real-world benchmark data. Compared with the existing systems, the new methodology that this thesis presents stands out, possessing significant and superior characteristics in pattern post-analysis and decision support. Pattern clustering Pattern pruning Pattern summarization Pattern discovery Simultaneous pattern and data clustering Pattern post-analysis Interestingness measures Association rules Itemsets Electrical and Computer Engineering
19	Recherche de motifs graduels et application aux données médicales Lisa, Di Jorio 05 October 2010 (has links) (PDF) Avec le développement des nouvelles technologies d'analyse (comme par exemple les puces à ADN) et de gestion de l'information (augmentation des capacités de stockage), le domaine de la santé a particulièrement évolué ces dernières années. En effet, des techniques de plus en plus avancées et efficaces sont mises à disposition des chercheurs, et permettent une étude approfondie des paramètres génomiques intervenant dans des problèmes de santé divers (cancer, maladie d'Alzheimer ...) ainsi que la mise en relation avec les paramètres cliniques. Parallèlement, l'évolution des capacités de stockage permet désormais d'accumuler la masse d'information générée par les diverses expériences menées. Ainsi, les avancées en terme de médecine et de prévention passent par l'analyse complète et pertinente de cette quantité de données. Le travail de cette thèse s'inscrit dans ce contexte médical. Nous nous sommes particulièrement intéressé à l'extraction automatique de motifs graduels, qui mettent en évidence des corrélations de variation entre attributs de la forme "plus un patient est âgé, moins ses souvenirs sont précis". Nous décrivons divers types de motifs graduels tels que les itemsets graduels, les itemset multidimensionnels graduels ou encore les motifs séquentiels graduels, ainsi que les sémantiques associées à ces motifs. Chacune de nos approches est testée sur un jeu de données synthétique et/ou réel. Extraction de connaissances fouille de données règles d'association gradualité motifs graduels itemsets graduels motifs séquentiels bases médicales
20	Získávání znalostí z obchodních procesů / Business Process Mining Skácel, Jan January 2015 (has links) This thesis explains business process mining and it's principles. A substantial part is devoted to the problems of process discovery. Further, based on the analysis of specific manufacturing process are proposed three methods that are trying to identify shortcomings in the process. First discovers the manufacturing process and renders it into a graph. The second method uses simulator of production history to obtain products that may caused delays in the process. Acquired data are used to mine frequent itemsets. The third method tries to predict processing time on the selected workplace using asociation rules. Last two mentioned methods employ an algorithm Frequent Pattern Growth. The knowledge obtained from this thesis improve efficiency of the manufacturing process and enables better production planning.

Search results