• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 12
  • 5
  • 3
  • 2
  • 2
  • 1
  • Tagged with
  • 28
  • 28
  • 12
  • 11
  • 7
  • 7
  • 7
  • 6
  • 6
  • 6
  • 6
  • 6
  • 5
  • 5
  • 4
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
11

Automated Biological Data Acquisition And Integration Using Machine Learning Techniques

Carkacioglu, Levent 01 February 2009 (has links) (PDF)
Since the initial genome sequencing projects along with the recent advances on technology, molecular biology and large scale transcriptome analysis result in data accumulation at a large scale. These data have been provided in different platforms and come from different laboratories therefore, there is a need for compilation and comprehensive analysis. In this thesis, we addressed the automatization of biological data acquisition and integration from these non-uniform data using machine learning techniques. We focused on two different mining studies in the scope of this thesis. In the first study, we worked on characterizing expression patterns of housekeeping genes. We described methodologies to compare measures of housekeeping genes with non-housekeeping genes. In the second study, we proposed a novel framework, bi-k-bi clustering, for finding association rules of gene pairs that can easily operate on large scale and multiple heterogeneous data sets. Results in both studies showed consistency and relatedness with the available literature. Furthermore, our results provided some novel insights waiting to be experimented by the biologists.
12

From sequential patterns to concurrent branch patterns : a new post sequential patterns mining approach

Lu, Jing January 2006 (has links)
Sequential patterns mining is an important pattern discovery technique used to identify frequently observed sequential occurrence of items across ordered transactions over time. It has been intensively studied and there exists a great diversity of algorithms. However, there is a major problem associated with the conventional sequential patterns mining in that patterns derived are often large and not very easy to understand or use. In addition, more complex relations among events are often hidden behind sequences. A novel model for sequential patterns called Sequential Patterns Graph (SPG) is proposed. The construction algorithm of SPG is presented with experimental results to substantiate the concept. The thesis then sets out to define some new structural patterns such as concurrent branch patterns, exclusive patterns and iterative patterns which are generally hidden behind sequential patterns. Finally, an integrative framework, named Post Sequential Patterns Mining (PSPM), which is based on sequential patterns mining, is also proposed for the discovery and visualisation of structural patterns. This thesis is intended to prove that discrete sequential patterns derived from traditional sequential patterns mining can be modelled graphically using SPG. It is concluded from experiments and theoretical studies that SPG is not only a minimal representation of sequential patterns mining, but it also represents the interrelation among patterns and establishes further the foundation for mining structural knowledge (i.e. concurrent branch patterns, exclusive patterns and iterative patterns). from experiments conducted on both synthetic and real datasets, it is shown that Concurrent Branch Patterns (CBP) mining is an effective and efficient mining algorithm suitable for concurrent branch patterns.
13

Unsupervised discovery of activity primitives from multivariate sensor data

Minnen, David 08 July 2008 (has links)
This research addresses the problem of temporal pattern discovery in real-valued, multivariate sensor data. Several algorithms were developed, and subsequent evaluation demonstrates that they can efficiently and accurately discover unknown recurring patterns in time series data taken from many different domains. Different data representations and motif models were investigated in order to design an algorithm with an improved balance between run-time and detection accuracy. The different data representations are used to quickly filter large data sets in order to detect potential patterns that form the basis of a more detailed analysis. The representations include global discretization, which can be efficiently analyzed using a suffix tree, local discretization with a corresponding random projection algorithm for locating similar pairs of subsequences, and a density-based detection method that operates on the original, real-valued data. In addition, a new variation of the multivariate motif discovery problem is proposed in which each pattern may span only a subset of the input features. An algorithm that can efficiently discover such "subdimensional" patterns was developed and evaluated. The discovery algorithms are evaluated by measuring the detection accuracy of discovered patterns relative to a set of expected patterns for each data set. The data sets used for evaluation are drawn from a variety of domains including speech, on-body inertial sensors, music, American Sign Language video, and GPS tracks.
14

Spatio-temporal pattern discovery and hypothesis exploration using a delay reconstruction approach

Campbell, Alexander B. January 2008 (has links)
This thesis investigates the computer-based modelling and simulation of complex geospatial phenomena. Geospatial systems are real world processes which extend over some meaningful extent of the Earth's surface, such as cities and fisheries. There are many problems that require urgent attention in this domain (for example relating to sustainability) but despite increasing amounts of data and computational power there is a significant gap between the potential for model-based analyses and their actual impact on real world policy and planning. Analytical methods are confounded by the high dimensionality and nonlinearity of spatio-temporal systems and/or are hard to relate to meaningful policy decisions. Simulation-based approaches on the other hand are more heuristic and policy oriented in nature, but they are difficult to validate and almost always over-fit the data: although a given model can be calibrated on a given set of data, it usually performs very poorly on new unseen data sets. The central contribution of this thesis is a framework which is formally grounded and able to be rigourously validated, yet at the same time is interpretable in terms of real world phenomena and thus has a strong connection to domain knowledge. The scope of the thesis spans both theory and practice, and three specific contributions range along this span. Starting at the theoretical end, the first contribution concerns the conceptual and theoretical basis of the framework, which is a technique known as delay reconstruction. The underlying theory is rooted in the rather technical field of dynamical systems (itself largely based on differential topology), which has hindered its wider application and the formation of strong links with other areas. Therefore, the first contribution is an exposition of delay reconstruction in non-technical language, with a focus on explaining how some recent extensions to this theory make the concept far more widely applicable than is often assumed. The second contribution uses this theoretical foundation to develop a practical, unified framework for pattern discovery and hypothesis exploration in geo-spatial data. The central aspect of this framework is the linking of delay reconstruction with domain knowledge. This is done via the notion that determinism is not an on-off quantity, but rather that a given data set may be ascribed a particular 'degree' of determinism, and that that degree may be increased through manipulation of the data set using domain knowledge. This leads to a framework which can handle spatiotemporally complex (including multi-scale) data sets, is sensitive to the amount of data that is available, and is naturally geared to be used interactively with qualitative feedback conveyed to the user via geometry. The framework is complementary to other techniques in that it forms a scaffold within which almost all modelling approaches - including agent-based modelling - can be cast as particular kinds of 'manipulations' of the data, and as such are easily integrated. The third contribution examines the practical efficacy of the framework in a real world case study. This involves a high resolution spatio-temporal record of fishcatch data from trawlers operating in a large fishery. The study is used to test two fundamental capabilities of the framework: (i) whether real world spatio-temporal phenomena can be identified in the degree-of-determinism signature of the data set, (ii) whether the determinism-level can then be increased by manipulating the data in response to this phenomena. One of the main outcomes of this study is a clear identification of the influence of the lunar cycle on the behaviour of Tiger and Endeavour prawns. The framework allows for this to be 'non-destructively subtracted', increasing the detect-ability of further phenomena.
15

Détection automatique de déviations chirurgicales et identification de comportements chirurgicaux par modélisation et analyse des processus chirurgicaux / Automatic detection of sugical deviations and identification of surgical behavior thanks to modelisation and analysis of surgical process

Huaulme, Arnaud 25 January 2017 (has links)
Les événements indésirables (EIs) sont devenus une vraie préoccupation du monde médical, leur réduction étant recherchée pour assurer la meilleure sécurité possible pour les patients. Les EIs sont, selon la HAS, ‘‘ des situations qui s'écartent de procédures ou de résultats escomptés dans une situation habituelle et qui sont ou qui seraient potentiellement sources de dommages’’. Alors que les EIs postopératoires sont étudiés depuis de nombreuses années, ceux ayant lieu au cours des opérations ne le sont que depuis récemment, comme le montre la récente classification des EIs intraopératoires par Kaafarani et al. publié en 2014. Cependant, la classification d'EIs intraopératoires n'est que la première étape pour comprendre les comportements chirurgicaux qui les entraînent.Dans cette thèse, nous présenterons des méthodes pour détecter l'apparition de déviations dues à l'apparition d'EIs intraopératoires et pour identifier des comportements chirurgicaux à partir de modèle de processus chirurgicaux.Ce travail a nécessité de concevoir et développer une modélisation formelle de la rectopexie et des événements indésirables qui sont associés à cette procédure chirurgicale grâceà la mise en place d'ontologies. Cette modélisation formelle nous a permis de bien appréhender le principe de cette opération et de fournir un vocabulaire permettant une annotation détaillé de vidéos endoscopiques de rectopexies, afin de créer des modèles de processus chirurgicaux en jeu.Grâce à l'annotation des vidéos chirurgicales basée sur cette modélisation, nous avons développé une une méthode de détection automatique des déviations dues à l'apparition d'événements indésirables Cette méthode est basée sur un alignement temporel non-linéaire multi-dimensionnel, que nous avons développé, suivi d'un modèle semi-Markovien caché que nous avons entraîné pour déterminer s'il existe des déviations par rapport à une chirurgie de référence et si celles-ci sont dues à des événements indésirables.Cette détection de déviations dues aux événements indésirables est la première étape afin de comprendre les raisons de leurs apparitions. Nous émettons l'hypothèse que leurs apparitions peuvent être expliquées par une succession d’activités, c'est-à-dire un pattern. Pour répondre à cette hypothèse, nous avons mis en place une méthode de découverte de patterns permettant d'identifier les comportements chirurgicaux spécifiques à différents critères. Cette identification de comportements chirurgicaux est réalisée par une classification ascendante hiérarchique avec la mise en place d'une nouvelle métrique basée sur les patterns partagés entre les chirurgies. Afin de valider notre méthode, nous l'avons comparé à deux études mettant en évidence des différences de comportements chirurgicaux, comme par exemple entre différents sites chirurgicaux ou entre deux types de procédure de la même opération. Une fois la méthode validée, nous avons utilisé notre méthode afin de montrer s'il existait des comportements chirurgicaux spécifiques à des données préopératoires et à l'apparition d'événements indésirables.Pour finir, nous revenons sur les contributions les plus importantes de ces travaux à travers une discussion générale et nous proposons différentes pistes pour améliorer nos résultats / L'auteur n'a pas fourni de résumé en anglais
16

CHSPAM: um modelo multi-domínio para acompanhamento de padrões em históricos de contextos

DUPONT, Daniel Ambrosi 21 March 2017 (has links)
Submitted by JOSIANE SANTOS DE OLIVEIRA (josianeso) on 2017-05-22T11:24:15Z No. of bitstreams: 1 Daniel Ambrosi Dupont_.pdf: 2397654 bytes, checksum: 6d41d597126ff9f150969b3b5ad9fd1b (MD5) / Made available in DSpace on 2017-05-22T11:24:15Z (GMT). No. of bitstreams: 1 Daniel Ambrosi Dupont_.pdf: 2397654 bytes, checksum: 6d41d597126ff9f150969b3b5ad9fd1b (MD5) Previous issue date: 2017-03-21 / Nenhuma / A Computação Ubíqua estuda o desenvolvimento de técnicas que visam integrar perfeitamente a tecnologia da informação ao cotidiano das pessoas, de modo que elas sejam auxiliadas pelos recursos tecnológicos no mundo real, de forma pró-ativa, enquanto realizam atividades diárias. Um dos aspectos fundamentais para o desenvolvimento deste tipo de aplicação é a questão da Sensibilidade ao Contexto, que permite a uma aplicação adaptar o seu funcionamento conforme o contexto no qual o usuário se encontra. Com o desenvolvimento de sistemas que utilizam informações de contextos armazenados anteriormente, foram surgindo bases de dados que armazenam os Históricos de Contextos capturados ao longo do tempo. Muitos pesquisadores têm estudado diferentes formas para realização de análises nestes dados. Este trabalho aborda um tipo específico de análise de dados em históricos de contextos, que é a busca e acompanhamento de padrões. Deste modo, é proposto um modelo denominado CHSPAM (Context History Pattern Monitoring) que permite a realização de descoberta e acompanhamento de padrões sequenciais em bases de Históricos de Contextos, fazendo uso de técnicas de mineração de dados já existentes. O diferencial deste trabalho é o uso de uma representação genérica para o armazenamento de contextos, permitindo sua aplicação em múltiplos domínios. Outro diferencial é que o modelo realiza o acompanhamento dos padrões descobertos durante o tempo, armazenando um histórico da evolução de cada padrão. Um protótipo foi implementado e, a partir dele, foram realizados três experimentos. O primeiro foi utilizado para avaliar as funcionalidades e serviços oferecidos pelo CHSPAM e foi baseado em dados sintéticos. No segundo, o modelo foi utilizado em uma aplicação de predição e o acompanhamento de padrões proporcionou ganhos na precisão das predições quando comparado ao uso de padrões sem acompanhamento. Por fim, no terceiro experimento, o CHSPAM foi utilizado como componente de uma aplicação de recomendação de objetos de aprendizagem e a aplicação foi capaz de identificar objetos relacionados aos interesses de alunos, utilizando como base o acompanhamento de padrões. / Ubiquitous computing aims to make tasks that depend on computing, transparent to users, thus, providing resources and services anytime and anywhere. One of the key factors to the development this type of application is the matter of Context Awareness, which enables an application to adjust its operation as the situation in which the user is. Thus, several authors have presented formal definitions of what is a context and how to represent it. With the development of systems that use Context information previously stored, databases have emerged that store Historical Contexts captured over time. Many researchers have studied different ways to analyzes this data. This paper addresses a specific type of data analysis in historical contexts, which is the discovery and monitoring of patterns in Context Histories. For this purpose, a model called CHSPAM (Context History Pattern Monitoring) is proposed, that allows the discovery of patterns in Context History databases and keeps track of these patterns to monitor their evolution over the time. Ubiquitous computing aims aim to integrate information technology perfectly into people's daily lives, so that people are aided by technological resources in the real world, proactively, while performing daily activities. One of the fundamental aspects for the development of this type of application is the issue of Context Awareness, which allows an application to adapt its operation according to the context in which the user is. With the development of systems that use information from previously stored contexts, databases have emerged that store captured Context Histories over time. Many researchers have studied different ways to perform analyzes on these data. This work addresses a specific type of data analysis in context histories, which is the search for sequential patterns. With this purpose, a model called CHSPAM (Context History Pattern Monitoring) is proposed that allows the discovery of sequential patterns in Context Historical databases, making use of existing data mining techniques. The main contributions of this work are the use of a generic representation for the storage of contexts allowing its application in multiple domains. Another contribution is that the model monitors the patterns discovered over time, storing history pattern evolution. A prototype of the model was implemented, and from it three experiments were carried out for its validation. The first experiment was used to evaluate the functionalities and services offered by CHSPAM and was based on simulated data. In the second experiment, the model was used in a prediction application and the use of monitored sequential patterns provided accuracy improvement on predictions when compared to the use of common patterns. Finally, in the third experiment, CHSPAM was used as a component of a learning object recommendation application and the application was able to recommend objects related to students’ interests based on monitored sequential patterns extracted from users’ session history.
17

Association Pattern Analysis for Pattern Pruning, Clustering and Summarization

Li, Chung Lam 12 September 2008 (has links)
Automatic pattern mining from databases and the analysis of the discovered patterns for useful information are important and in great demand in science, engineering and business. Today, effective pattern mining methods, such as association rule mining and pattern discovery, have been developed and widely used in various challenging industrial and business applications. These methods attempt to uncover the valuable information trapped in large collections of raw data. The patterns revealed provide significant and useful information for decision makers. Paradoxically, pattern mining itself can produce such huge amounts of data that poses a new knowledge management problem: to tackle thousands or even more patterns discovered and held in a data set. Unlike raw data, patterns often overlap, entangle and interrelate to each other in the databases. The relationship among them is usually complex and the notion of distance between them is difficult to qualify and quantify. Such phenomena pose great challenges to the existing data mining discipline. In this thesis, the analysis of patterns after their discovery by existing pattern mining methods is referred to as pattern post-analysis since the patterns to be analyzed are first discovered. Due to the overwhelmingly huge volume of discovered patterns in pattern mining, it is virtually impossible for a human user to manually analyze them. Thus, the valuable trapped information in the data is shifted to a large collection of patterns. Hence, to automatically analyze the patterns discovered and present the results in a user-friendly manner such as pattern post-analysis is badly needed. This thesis attempts to solve the problems listed below. It addresses 1) the important factors contributing to the interrelating relationship among patterns and hence more accurate measurements of distances between them; 2) the objective pruning of redundant patterns from the discovered patterns; 3) the objective clustering of the patterns into coherent pattern clusters for better organization; 4) the automatic summarization of each pattern cluster for human interpretation; and 5) the application of pattern post-analysis to large database analysis and data mining. In this thesis, the conceptualization, theoretical formulation, algorithm design and system development of pattern post-analysis of categorical or discrete-valued data is presented. It starts with presenting a natural dual relationship between patterns and data. The relationship furnishes an explicit one-to-one correspondence between a pattern and its associated data and provides a base for an effective analysis of patterns by relating them back to the data. It then discusses the important factors that differentiate patterns and formulates the notion of distances among patterns using a formal graphical approach. To accurately measure the distances between patterns and their associated data, both the samples and the attributes matched by the patterns are considered. To achieve this, the distance measure between patterns has to account for the differences of their associated data clusters at the attribute value (i.e. item) level. Furthermore, to capture the degree of variation of the items matched by patterns, entropy-based distance measures are developed. It attempts to quantify the uncertainty of the matched items. Such distances render an accurate and robust distance measurement between patterns and their associated data. To understand the properties and behaviors of the new distance measures, the mathematical relation between the new distances and the existing sample-matching distances is analytically derived. The new pattern distances based on the dual pattern-data relationship and their related concepts are used and adapted to pattern pruning, pattern clustering and pattern summarization to furnish an integrated, flexible and generic framework for pattern post-analysis which is able to meet the challenges of today’s complex real-world problems. In pattern pruning, the system defines the amount of redundancy of a pattern with respect to another pattern at the item level. Such definition generalizes the classical closed itemset pruning and maximal itemset pruning which define redundancy at the sample level. A new generalized itemset pruning method is developed using the new definition. It includes the closed and maximal itemsets as two extreme special cases and provides a control parameter for the user to adjust the tradeoff between the number of patterns being pruned and the amount of information loss after pruning. The mathematical relation between the proposed generalized itemsets and the existing closed and maximal itemsets are also given. In pattern clustering, a dual clustering method, known as simultaneous pattern and data clustering, is developed using two common yet very different types of clustering algorithms: hierarchical clustering and k-means clustering. Hierarchical clustering generates the entire clustering hierarchy but it is slow and not scalable. K-means clustering produces only a partition so it is fast and scalable. They can be used to handle most real-world situations (i.e. speed and clustering quality). The new clustering method is able to simultaneously cluster patterns as well as their associated data while maintaining an explicit pattern-data relationship. Such relationship enables subsequent analysis of individual pattern clusters through their associated data clusters. One important analysis on a pattern cluster is pattern summarization. In pattern summarization, to summarize each pattern cluster, a subset of the representative patterns will be selected for the cluster. Again, the system measures how representative a pattern is at the item level and takes into account how the patterns overlap each other. The proposed method, called AreaCover, is extended from the well-known RuleCover algorithm. The relationship between the two methods is given. AreaCover is less prone to yield large, trivial patterns (large patterns may cause summary that is too general and not informative enough), and the resulting summary is more concise (with less duplicated attribute values among summary patterns) and more informative (describing more attribute values in the cluster and have longer summary patterns). The thesis also covers the implementation of the major ideas outlined in the pattern post-analysis framework in an integrated software system. It ends with a discussion on the experimental results of pattern post-analysis on both synthetic and real-world benchmark data. Compared with the existing systems, the new methodology that this thesis presents stands out, possessing significant and superior characteristics in pattern post-analysis and decision support.
18

Association Pattern Analysis for Pattern Pruning, Clustering and Summarization

Li, Chung Lam 12 September 2008 (has links)
Automatic pattern mining from databases and the analysis of the discovered patterns for useful information are important and in great demand in science, engineering and business. Today, effective pattern mining methods, such as association rule mining and pattern discovery, have been developed and widely used in various challenging industrial and business applications. These methods attempt to uncover the valuable information trapped in large collections of raw data. The patterns revealed provide significant and useful information for decision makers. Paradoxically, pattern mining itself can produce such huge amounts of data that poses a new knowledge management problem: to tackle thousands or even more patterns discovered and held in a data set. Unlike raw data, patterns often overlap, entangle and interrelate to each other in the databases. The relationship among them is usually complex and the notion of distance between them is difficult to qualify and quantify. Such phenomena pose great challenges to the existing data mining discipline. In this thesis, the analysis of patterns after their discovery by existing pattern mining methods is referred to as pattern post-analysis since the patterns to be analyzed are first discovered. Due to the overwhelmingly huge volume of discovered patterns in pattern mining, it is virtually impossible for a human user to manually analyze them. Thus, the valuable trapped information in the data is shifted to a large collection of patterns. Hence, to automatically analyze the patterns discovered and present the results in a user-friendly manner such as pattern post-analysis is badly needed. This thesis attempts to solve the problems listed below. It addresses 1) the important factors contributing to the interrelating relationship among patterns and hence more accurate measurements of distances between them; 2) the objective pruning of redundant patterns from the discovered patterns; 3) the objective clustering of the patterns into coherent pattern clusters for better organization; 4) the automatic summarization of each pattern cluster for human interpretation; and 5) the application of pattern post-analysis to large database analysis and data mining. In this thesis, the conceptualization, theoretical formulation, algorithm design and system development of pattern post-analysis of categorical or discrete-valued data is presented. It starts with presenting a natural dual relationship between patterns and data. The relationship furnishes an explicit one-to-one correspondence between a pattern and its associated data and provides a base for an effective analysis of patterns by relating them back to the data. It then discusses the important factors that differentiate patterns and formulates the notion of distances among patterns using a formal graphical approach. To accurately measure the distances between patterns and their associated data, both the samples and the attributes matched by the patterns are considered. To achieve this, the distance measure between patterns has to account for the differences of their associated data clusters at the attribute value (i.e. item) level. Furthermore, to capture the degree of variation of the items matched by patterns, entropy-based distance measures are developed. It attempts to quantify the uncertainty of the matched items. Such distances render an accurate and robust distance measurement between patterns and their associated data. To understand the properties and behaviors of the new distance measures, the mathematical relation between the new distances and the existing sample-matching distances is analytically derived. The new pattern distances based on the dual pattern-data relationship and their related concepts are used and adapted to pattern pruning, pattern clustering and pattern summarization to furnish an integrated, flexible and generic framework for pattern post-analysis which is able to meet the challenges of today’s complex real-world problems. In pattern pruning, the system defines the amount of redundancy of a pattern with respect to another pattern at the item level. Such definition generalizes the classical closed itemset pruning and maximal itemset pruning which define redundancy at the sample level. A new generalized itemset pruning method is developed using the new definition. It includes the closed and maximal itemsets as two extreme special cases and provides a control parameter for the user to adjust the tradeoff between the number of patterns being pruned and the amount of information loss after pruning. The mathematical relation between the proposed generalized itemsets and the existing closed and maximal itemsets are also given. In pattern clustering, a dual clustering method, known as simultaneous pattern and data clustering, is developed using two common yet very different types of clustering algorithms: hierarchical clustering and k-means clustering. Hierarchical clustering generates the entire clustering hierarchy but it is slow and not scalable. K-means clustering produces only a partition so it is fast and scalable. They can be used to handle most real-world situations (i.e. speed and clustering quality). The new clustering method is able to simultaneously cluster patterns as well as their associated data while maintaining an explicit pattern-data relationship. Such relationship enables subsequent analysis of individual pattern clusters through their associated data clusters. One important analysis on a pattern cluster is pattern summarization. In pattern summarization, to summarize each pattern cluster, a subset of the representative patterns will be selected for the cluster. Again, the system measures how representative a pattern is at the item level and takes into account how the patterns overlap each other. The proposed method, called AreaCover, is extended from the well-known RuleCover algorithm. The relationship between the two methods is given. AreaCover is less prone to yield large, trivial patterns (large patterns may cause summary that is too general and not informative enough), and the resulting summary is more concise (with less duplicated attribute values among summary patterns) and more informative (describing more attribute values in the cluster and have longer summary patterns). The thesis also covers the implementation of the major ideas outlined in the pattern post-analysis framework in an integrated software system. It ends with a discussion on the experimental results of pattern post-analysis on both synthetic and real-world benchmark data. Compared with the existing systems, the new methodology that this thesis presents stands out, possessing significant and superior characteristics in pattern post-analysis and decision support.
19

Effective Characterization of Sequence Data through Frequent Episodes

Ibrahim, A January 2015 (has links) (PDF)
Pattern discovery is an important area of data mining referring to a class of techniques designed for the extraction of interesting patterns from the data. A pattern is some kind of a local structure that captures correlations and dependencies present in the elements of the data. In general, pattern discovery is about finding all patterns of `interest' in the data and a popular measure of interestingness for a pattern is its frequency of occurrence in the data. Thus the problem of frequent pattern discovery is to find all patterns in the data whose frequency of occurrence exceeds some user defined threshold. However, frequency of a pattern is not the only measure for finding patterns of interest and there also exist other measures and techniques for finding interesting patterns. This thesis is concerned with efficient discovery of inherent patterns from long sequence (or temporally ordered) data. Mining of such sequentially ordered data is called temporal data mining and the temporal patterns that are discovered from large sequential data are called episodes. More specifically, this thesis explores efficient methods for finding small and relevant subsets of episodes from sequence data that best characterize the data. The thesis also discusses methods for comparing datasets, based on comparing the sets of patterns representing the datasets. The data in a frequent episode discovery framework is abstractly viewed as a single long sequence of events. Here, the event is a tuple, (Ei; ti), where Ei is referred to as an event-type (taking values from a finite alphabet set) and ti is the time of occurrence. The events are ordered in the non-decreasing order of the time of occurrence. The pattern of interest in such a sequence is called an episode, which is a collection of event-types with a partial order defined over it. In this thesis, the focus is on a special type of episode called serial episode, where there is a total order defined among the collection of event-types representing the episode. The occurrence of an episode is essentially a subset of events from the data whose event-types match the set of eventtypes associated with the episode and the order in which they occur conforms to the underlying partial order of the episode. The frequency of an episode is some measure of how often it occurs in the event stream. Many different notions of frequency have been defined in literature. Given a frequency definition, the goal of frequent episode discovery is to unearth all episodes which have a frequency greater than a user-defined threshold. The size of an episode is the number of event-types in the episode. An episode β is called a subepisode of another episode β, if the collection of event-types of β is a subset of the corresponding collection of α and the event-types of β satisfy the same partial order relationships present among the corresponding event-types of α. The set of all episodes can be arranged in a partial order lattice, where each level i contains episodes of size i and the partial order is the subepisode relationship. In general, there are two approaches for mining frequent episodes, based on the way one traverses this lattice. The first approach is to traverse this lattice in a breadth-first manner, and is called the Apriori approach. The other approach is the Pattern growth approach, where the lattice is traversed in a depth-first manner. There exist different frequency notions for episodes, and many Apriori based algorithms have been proposed for mining frequent episodes under the different frequencies. However there do not exist Pattern-growth based methods for many of the frequency notions. The first part of the thesis proposes new Pattern-growth methods for discovering frequent serial episodes under two frequency notions called the non-overlapped frequency and the total frequency. Special cases, where certain additional conditions, called the span and gap constraints, are imposed on the occurrences of the episodes are also considered. The proposed methods, in general, consist of two steps: the candidate generation step and the counting step. The candidate generation step involves finding potential frequent episodes. This is done by following the general Pattern growth approach for finding the candidates, which is the depth-first traversal of the lattice of all episodes. The second step, which is the counting step, involves counting the frequencies of the episodes. The thesis presents efficient methods for counting the occurrences of serial episodes using occurrence windows of subepisodes for both the non-overlapped and total frequency. The relative advantages of Pattern-growth approaches over Apriori approaches are also discussed. Through detailed simulation results, the effectiveness of this approach on a host of synthetic and real data sets is shown. It is shown that the proposed methods are highly scalable and efficient in runtime as compared to the existing Apriori approaches. One of the main issues in frequent pattern mining is the huge number of frequent patterns, returned by the discovery methods, irrespective of the approach taken to solve the problems. The second part of this thesis, addresses this issue and discusses methods of selecting a small subset of relevant episodes from event sequences. There have been a few approaches, discussed in the literature, for finding a small subset of patterns. One set of methods are information theory based methods, where patterns that provide maximum information are searched for. Another approach is the Minimum Description Length (MDL) principle based summarization schemes. Here the data is encoded using a subset of patterns (which forms the model for the data) and its occurrences. The subset of patterns that has the maximum efficiency in encoding the data is the best representative model for the data. The MDL principle takes into account both the encoding efficiency of the model as well as model complexity. A method, called Constrained Serial episode Coding(CSC), is proposed based on the MDL principle, which returns a highly relevant, non-redundant and small subset of serial episodes. This also includes an encoding scheme, where the model representation and the encoding of the data are efficient. An interesting feature of this algorithm for isolating a small set of relevant episodes is that it does not need a user-specified threshold on frequency. The effectiveness of this method is shown on two types of data. The first is data obtained from a detailed simulator for a reconfigurable coupled conveyor system. The conveyor system consists of different intersecting paths and packages flow through such a network. Mining of such data can allow one to unearth the main paths of package ows which can be useful in remote monitoring and visualization of the system. On this data, it is shown that the proposed method is able to return highly consistent sub paths, in the form of serial episodes, with great encoding efficiency as compared to other known related sequence summarization schemes, like SQS and GoKrimp. The second type of data consists of a collection of multi-class sequence datasets. It is shown that the selected episodes from the proposed method form good features in classi cation. The proposed method is compared with SQS and GoKrimp, and it is shown that the episodes selected by this method help in achieving better classification results as compared to other methods. The third and nal part of the thesis discusses methods for comparing sets of patterns representing different datasets. There are many instances when one is interested in comparing datasets. For example, in streaming data, one is interested in knowing whether the characteristics of the data are the same or have changed significantly. In other cases, one may simply like to compare two datasets and quantify the degree of similarity between them. Often, data are characterized by a set of patterns as described above. Comparing sets of patterns representing datasets gives information about the similarity/dissimilarity between the datasets. However not many measures exist for comparing sets of patterns. This thesis proposes a similarity measure for comparing sets of patterns which in turn aids in comparison of di erent datasets. First, a kernel for comparing two patterns, called the Pattern Kernel, is proposed. This kernel is proposed for three types of patterns: serial episodes, sequential patterns and itemsets. Using this kernel, a Pattern Set Kernel is proposed for comparing different sets of patterns. The effectiveness of this kernel is shown in classification and change detection. The thesis concludes with a summary of the main contributions and some suggestions for extending the work presented here.
20

Discovering Frequent Episodes With General Partial Orders

Achar, Avinash 12 1900 (has links) (PDF)
Pattern Discovery, a popular paradigm in data mining refers to a class of techniques that try and extract some unknown or interesting patterns from data. The work carried out in this thesis concerns frequent episode mining, a popular framework within pattern discovery, with applications in alarm management, fault analysis, network reconstruction etc. The data here is in the form of a single longtime-ordered stream of events. The pattern of interest here, namely episode, is basically a set of event-types with a partial order on it. The task here is to unearth all patterns( episodes here) which have a frequency above a user-defined threshold irrespective of pattern size. Most current discovery algorithms employ a level-wise a priori-based method for mining, which basically adopts a breadth-first search strategy of the space of all episodes. The episode literature has seen multiple ways of defining frequency with each definition having its own set of merits and demerits. The main reason for different frequencies definitions being proposed is that, in general, counting all occurrences of a set of episodes is computationally very expensive. The first part of the thesis gives a unified view of all the apriori-based discovery algorithms for serial episodes(associated with a total order)under these various frequencies. Specifically, the various existing counting algorithms can be viewed as minor modifications of each other. We also provide some novel proofs of correctness for some of the serial episode counting schemes, which in turn can be generalized to episodes with general partial orders. Our unified view helps us derive quantitative relationships between different frequencies. We also discuss all the anti-monotonicity properties satisfied by the various frequencies, a crucial information needed for the candidate generation step. The second part of the thesis proposes discovery algorithms for episodes with general partial orders, for which no algorithms currently exist in literature. The discovery algorithm proposed is apriori-based and generalizes the existing serial and parallel (associated with a trivial order) episode algorithms. The discovery algorithm is a level-wise procedure involving the steps of candidate generation and counting a teach level. In the context of general partial orders, a major problem in a priori-based discovery is to have an efficient candidate generation scheme. We present a novel candidate generation algorithm for mining episodes with general partial orders. The counting algorithm design for general partial order episodes draws ideas from the unified view of counting for serial episodes, presented in the first part of the work. We formally show the correctness of the proposed candidate generation and counting steps for general partial orders. The proposed candidate generation algorithm is flexible enough to be able to mine in certain specialized classes of partial orders (satisfying what we call maximal sub episode property), of which, the serial and parallel class of episodes are two specific instances. Our algorithm design initially restricts itself to the class of general partial order episodes called injective episodes wherein repeated event-types are not allowed. We then generalize this to a larger class of episodes called chain episodes, where episodes can have some repeated event types. The class of chain episodes contains all (including non-injective) serial and parallel episodes and thus our method properly generalizes the existing methods for serial and parallel episode discovery. We also discuss some problems in extending our algorithms to episodes beyond the class of chain episodes. Also, we demonstrate that frequency alone is not a sufficient enough interestingness measure for episodes with unrestricted partial orders. To address this issue, we propose an additional measure called bidirectional evidence to assess interestingness which, along with frequency is found to be extremely effective in unearthing interesting patterns. In the frequent episode framework, the choice of thresholds are most often user-defined and arbitrary. To address this issue, the last part of the work deals with assessing significance of partial order episodes in a statistical sense based on ideas from classical hypothesis testing. We declare an episode to be significant if its observed frequency in the data stream is large enough to be very unlikely, under a random i.i.d model .The key step in the significance analysis involves the mean and variance computation of the the time between successive occurrences of the pattern. This computation can be reformulated as, solving for the mean and variance of the first visit time to a particular stat e in an associated Markov chain. We use a generating function approach to solve for this mean and variance. Using this and a Gaussian approximation to the frequency random variable, we can now calculate a frequency threshold for any partial order episode, beyond which we infer it to be significant. Our significance analysis for general partial order episodes generalizes the existing significance analysis of serial episode patterns. We demonstrate on synthetic data the effectiveness of our significance thresholds.

Page generated in 0.0932 seconds