Global ETD Search

1	Discovering Frequent Episodes With General Partial Orders Achar, Avinash 12 1900 (has links) (PDF) Pattern Discovery, a popular paradigm in data mining refers to a class of techniques that try and extract some unknown or interesting patterns from data. The work carried out in this thesis concerns frequent episode mining, a popular framework within pattern discovery, with applications in alarm management, fault analysis, network reconstruction etc. The data here is in the form of a single longtime-ordered stream of events. The pattern of interest here, namely episode, is basically a set of event-types with a partial order on it. The task here is to unearth all patterns( episodes here) which have a frequency above a user-defined threshold irrespective of pattern size. Most current discovery algorithms employ a level-wise a priori-based method for mining, which basically adopts a breadth-first search strategy of the space of all episodes. The episode literature has seen multiple ways of defining frequency with each definition having its own set of merits and demerits. The main reason for different frequencies definitions being proposed is that, in general, counting all occurrences of a set of episodes is computationally very expensive. The first part of the thesis gives a unified view of all the apriori-based discovery algorithms for serial episodes(associated with a total order)under these various frequencies. Specifically, the various existing counting algorithms can be viewed as minor modifications of each other. We also provide some novel proofs of correctness for some of the serial episode counting schemes, which in turn can be generalized to episodes with general partial orders. Our unified view helps us derive quantitative relationships between different frequencies. We also discuss all the anti-monotonicity properties satisfied by the various frequencies, a crucial information needed for the candidate generation step. The second part of the thesis proposes discovery algorithms for episodes with general partial orders, for which no algorithms currently exist in literature. The discovery algorithm proposed is apriori-based and generalizes the existing serial and parallel (associated with a trivial order) episode algorithms. The discovery algorithm is a level-wise procedure involving the steps of candidate generation and counting a teach level. In the context of general partial orders, a major problem in a priori-based discovery is to have an efficient candidate generation scheme. We present a novel candidate generation algorithm for mining episodes with general partial orders. The counting algorithm design for general partial order episodes draws ideas from the unified view of counting for serial episodes, presented in the first part of the work. We formally show the correctness of the proposed candidate generation and counting steps for general partial orders. The proposed candidate generation algorithm is flexible enough to be able to mine in certain specialized classes of partial orders (satisfying what we call maximal sub episode property), of which, the serial and parallel class of episodes are two specific instances. Our algorithm design initially restricts itself to the class of general partial order episodes called injective episodes wherein repeated event-types are not allowed. We then generalize this to a larger class of episodes called chain episodes, where episodes can have some repeated event types. The class of chain episodes contains all (including non-injective) serial and parallel episodes and thus our method properly generalizes the existing methods for serial and parallel episode discovery. We also discuss some problems in extending our algorithms to episodes beyond the class of chain episodes. Also, we demonstrate that frequency alone is not a sufficient enough interestingness measure for episodes with unrestricted partial orders. To address this issue, we propose an additional measure called bidirectional evidence to assess interestingness which, along with frequency is found to be extremely effective in unearthing interesting patterns. In the frequent episode framework, the choice of thresholds are most often user-defined and arbitrary. To address this issue, the last part of the work deals with assessing significance of partial order episodes in a statistical sense based on ideas from classical hypothesis testing. We declare an episode to be significant if its observed frequency in the data stream is large enough to be very unlikely, under a random i.i.d model .The key step in the significance analysis involves the mean and variance computation of the the time between successive occurrences of the pattern. This computation can be reformulated as, solving for the mean and variance of the first visit time to a particular stat e in an associated Markov chain. We use a generating function approach to solve for this mean and variance. Using this and a Gaussian approximation to the frequency random variable, we can now calculate a frequency threshold for any partial order episode, beyond which we infer it to be significant. Our significance analysis for general partial order episodes generalizes the existing significance analysis of serial episode patterns. We demonstrate on synthetic data the effectiveness of our significance thresholds. Datamining With Partial Orders Datamining Pattern Discovery Partial Order Episodes Episode Discovery General Partial Order Episodes Apriori-Based Episode Discovery Episode Mining Frequent Episode Discovery Computer Science
2	Mining Statistically Significant Temporal Associations In Multiple Event Sequences Liang, Han Unknown Date No description available. event sequences
3	Effective Characterization of Sequence Data through Frequent Episodes Ibrahim, A January 2015 (has links) (PDF) Pattern discovery is an important area of data mining referring to a class of techniques designed for the extraction of interesting patterns from the data. A pattern is some kind of a local structure that captures correlations and dependencies present in the elements of the data. In general, pattern discovery is about finding all patterns of `interest' in the data and a popular measure of interestingness for a pattern is its frequency of occurrence in the data. Thus the problem of frequent pattern discovery is to find all patterns in the data whose frequency of occurrence exceeds some user defined threshold. However, frequency of a pattern is not the only measure for finding patterns of interest and there also exist other measures and techniques for finding interesting patterns. This thesis is concerned with efficient discovery of inherent patterns from long sequence (or temporally ordered) data. Mining of such sequentially ordered data is called temporal data mining and the temporal patterns that are discovered from large sequential data are called episodes. More specifically, this thesis explores efficient methods for finding small and relevant subsets of episodes from sequence data that best characterize the data. The thesis also discusses methods for comparing datasets, based on comparing the sets of patterns representing the datasets. The data in a frequent episode discovery framework is abstractly viewed as a single long sequence of events. Here, the event is a tuple, (Ei; ti), where Ei is referred to as an event-type (taking values from a finite alphabet set) and ti is the time of occurrence. The events are ordered in the non-decreasing order of the time of occurrence. The pattern of interest in such a sequence is called an episode, which is a collection of event-types with a partial order defined over it. In this thesis, the focus is on a special type of episode called serial episode, where there is a total order defined among the collection of event-types representing the episode. The occurrence of an episode is essentially a subset of events from the data whose event-types match the set of eventtypes associated with the episode and the order in which they occur conforms to the underlying partial order of the episode. The frequency of an episode is some measure of how often it occurs in the event stream. Many different notions of frequency have been defined in literature. Given a frequency definition, the goal of frequent episode discovery is to unearth all episodes which have a frequency greater than a user-defined threshold. The size of an episode is the number of event-types in the episode. An episode β is called a subepisode of another episode β, if the collection of event-types of β is a subset of the corresponding collection of α and the event-types of β satisfy the same partial order relationships present among the corresponding event-types of α. The set of all episodes can be arranged in a partial order lattice, where each level i contains episodes of size i and the partial order is the subepisode relationship. In general, there are two approaches for mining frequent episodes, based on the way one traverses this lattice. The first approach is to traverse this lattice in a breadth-first manner, and is called the Apriori approach. The other approach is the Pattern growth approach, where the lattice is traversed in a depth-first manner. There exist different frequency notions for episodes, and many Apriori based algorithms have been proposed for mining frequent episodes under the different frequencies. However there do not exist Pattern-growth based methods for many of the frequency notions. The first part of the thesis proposes new Pattern-growth methods for discovering frequent serial episodes under two frequency notions called the non-overlapped frequency and the total frequency. Special cases, where certain additional conditions, called the span and gap constraints, are imposed on the occurrences of the episodes are also considered. The proposed methods, in general, consist of two steps: the candidate generation step and the counting step. The candidate generation step involves finding potential frequent episodes. This is done by following the general Pattern growth approach for finding the candidates, which is the depth-first traversal of the lattice of all episodes. The second step, which is the counting step, involves counting the frequencies of the episodes. The thesis presents efficient methods for counting the occurrences of serial episodes using occurrence windows of subepisodes for both the non-overlapped and total frequency. The relative advantages of Pattern-growth approaches over Apriori approaches are also discussed. Through detailed simulation results, the effectiveness of this approach on a host of synthetic and real data sets is shown. It is shown that the proposed methods are highly scalable and efficient in runtime as compared to the existing Apriori approaches. One of the main issues in frequent pattern mining is the huge number of frequent patterns, returned by the discovery methods, irrespective of the approach taken to solve the problems. The second part of this thesis, addresses this issue and discusses methods of selecting a small subset of relevant episodes from event sequences. There have been a few approaches, discussed in the literature, for finding a small subset of patterns. One set of methods are information theory based methods, where patterns that provide maximum information are searched for. Another approach is the Minimum Description Length (MDL) principle based summarization schemes. Here the data is encoded using a subset of patterns (which forms the model for the data) and its occurrences. The subset of patterns that has the maximum efficiency in encoding the data is the best representative model for the data. The MDL principle takes into account both the encoding efficiency of the model as well as model complexity. A method, called Constrained Serial episode Coding(CSC), is proposed based on the MDL principle, which returns a highly relevant, non-redundant and small subset of serial episodes. This also includes an encoding scheme, where the model representation and the encoding of the data are efficient. An interesting feature of this algorithm for isolating a small set of relevant episodes is that it does not need a user-specified threshold on frequency. The effectiveness of this method is shown on two types of data. The first is data obtained from a detailed simulator for a reconfigurable coupled conveyor system. The conveyor system consists of different intersecting paths and packages flow through such a network. Mining of such data can allow one to unearth the main paths of package ows which can be useful in remote monitoring and visualization of the system. On this data, it is shown that the proposed method is able to return highly consistent sub paths, in the form of serial episodes, with great encoding efficiency as compared to other known related sequence summarization schemes, like SQS and GoKrimp. The second type of data consists of a collection of multi-class sequence datasets. It is shown that the selected episodes from the proposed method form good features in classi cation. The proposed method is compared with SQS and GoKrimp, and it is shown that the episodes selected by this method help in achieving better classification results as compared to other methods. The third and nal part of the thesis discusses methods for comparing sets of patterns representing different datasets. There are many instances when one is interested in comparing datasets. For example, in streaming data, one is interested in knowing whether the characteristics of the data are the same or have changed significantly. In other cases, one may simply like to compare two datasets and quantify the degree of similarity between them. Often, data are characterized by a set of patterns as described above. Comparing sets of patterns representing datasets gives information about the similarity/dissimilarity between the datasets. However not many measures exist for comparing sets of patterns. This thesis proposes a similarity measure for comparing sets of patterns which in turn aids in comparison of di erent datasets. First, a kernel for comparing two patterns, called the Pattern Kernel, is proposed. This kernel is proposed for three types of patterns: serial episodes, sequential patterns and itemsets. Using this kernel, a Pattern Set Kernel is proposed for comparing different sets of patterns. The effectiveness of this kernel is shown in classification and change detection. The thesis concludes with a summary of the main contributions and some suggestions for extending the work presented here. Data Mining Pattern Discovery Pattern Mining Sequencial Pattern Episode Formalism Episode Discovery Pattern Set Kernel Episodes Pattern Kernel Frequent Episode Mining Electrical Engineering
4	Discovering Frequent Episodes : Fast Algorithms, Connections With HMMs And Generalizations Laxman, Srivatsan 03 1900 (has links) Temporal data mining is concerned with the exploration of large sequential (or temporally ordered) data sets to discover some nontrivial information that was previously unknown to the data owner. Sequential data sets come up naturally in a wide range of application domains, ranging from bioinformatics to manufacturing processes. Pattern discovery refers to a broad class of data mining techniques in which the objective is to unearth hidden patterns or unexpected trends in the data. In general, pattern discovery is about finding all patterns of 'interest' in the data and one popular measure of interestingness for a pattern is its frequency in the data. The problem of frequent pattern discovery is to find all patterns in the data whose frequency exceeds some user-defined threshold. Discovery of temporal patterns that occur frequently in sequential data has received a lot of attention in recent times. Different approaches consider different classes of temporal patterns and propose different algorithms for their efficient discovery from the data. This thesis is concerned with a specific class of temporal patterns called episodes and their discovery in large sequential data sets. In the framework of frequent episode discovery, data (referred to as an event sequence or an event stream) is available as a single long sequence of events. The ith event in the sequence is an ordered pair, (Et,tt), where Et takes values from a finite alphabet (of event types), and U is the time of occurrence of the event. The events in the sequence are ordered according to these times of occurrence. An episode (which is the temporal pattern considered in this framework) is a (typically) short partially ordered sequence of event types. Formally, an episode is a triple, (V,<,9), where V is a collection of nodes, < is a partial order on V and 9 is a map that assigns an event type to each node of the episode. When < is total, the episode is referred to as a serial episode, and when < is trivial (or empty), the episode is referred to as a parallel episode. An episode is said to occur in an event sequence if there are events in the sequence, with event types same as those constituting the episode, and with times of occurrence respecting the partial order in the episode. The frequency of an episode is some measure of how often it occurs in the event sequence. Given a frequency definition for episodes, the task is to discover all episodes whose frequencies exceed some threshold. This is done using a level-wise procedure. In each level, a candidate generation step is used to combine frequent episodes from the previous level to build candidates of the next larger size, and then a frequency counting step makes one pass over the event stream to determine frequencies of all the candidates and thus identify the frequent episodes. Frequency counting is the main computationally intensive step in frequent episode discovery. Choice of frequency definition for episodes has a direct bearing on the efficiency of the counting procedure. In the original framework of frequent episode discovery, episode frequency is defined as the number of fixed-width sliding windows over the data in which the episode occurs at least once. Under this frequency definition, frequency counting of a set of \|C\| candidate serial episodes of size N has space complexity O(N\|C\|) and time complexity O(ΔTN\|C\|) (where ΔT is the difference between the times of occurrence of the last and the first event in the data stream). The other main frequency definition available in the literature, defines episode frequency as the number of minimal occurrences of the episode (where, a minimal occurrence is a window on the time axis containing an occurrence of the episode, such that, no proper sub-window of it contains another occurrence of the episode). The algorithm for obtaining frequencies for a set of \|C\| episodes needs O(n\|C\|) time (where n denotes the number of events in the data stream). While this is time-wise better than the the windows-based algorithm, the space needed to locate minimal occurrences of an episode can be very high (and is in fact of the order of length, n, of the event stream). This thesis proposes a new definition for episode frequency, based on the notion of, what is called, non-overlapped occurrences of episodes in the event stream. Two occurrences are said to be non-overlapped if no event corresponding to one occurrence appears in between events corresponding to the other. Frequency of an episode is defined as the maximum possible number of non-overlapped occurrences of the episode in the data. The thesis also presents algorithms for efficient frequent episode discovery under this frequency definition. The space and time complexities for frequency counting of serial episodes are O(\|C\|) and O(n\|C\|) respectively (where n denotes the total number of events in the given event sequence and \|C\| denotes the num-ber of candidate episodes). These are arguably the best possible space and time complexities for the frequency counting step that can be achieved. Also, the fact that the time needed by the non-overlapped occurrences-based algorithm is linear in the number of events, n, in the event sequence (rather than the difference, ΔT, between occurrence times of the first and last events in the data stream, as is the case with the windows-based algorithm), can result in considerable time advantage when the number of time ticks far exceeds the number of events in the event stream. The thesis also presents efficient algorithms for frequent episode discovery under expiry time constraints (according to which, an occurrence of an episode can be counted for its frequency only if the total time span of the occurrence is less than a user-defined threshold). It is shown through simulation experiments that, in terms of actual run-times, frequent episode discovery under the non-overlapped occurrences-based frequency (using the algorithms developed here) is much faster than existing methods. There is also a second frequency measure that is proposed in this thesis, which is based on, what is termed as, non-interleaved occurrences of episodes in the data. This definition counts certain kinds of overlapping occurrences of the episode. The time needed is linear in the number of events, n, in the data sequence, the size, N, of episodes and the number of candidates, \|C\|. Simulation experiments show that run-time performance under this frequency definition is slightly inferior compared to the non-overlapped occurrences-based frequency, but is still better than the run-times under the windows-based frequency. This thesis also establishes the following interesting property that connects the non-overlapped, the non-interleaved and the minimal occurrences-based frequencies of an episode in the data: the number of minimal occurrences of an episode is bounded below by the maximum number of non-overlapped occurrences of the episode, and is bounded above by the maximum number of non-interleaved occurrences of the episode in the data. Hence, non-interleaved occurrences-based frequency is an efficient alternative to that based on minimal occurrences. In addition to being superior in terms of both time and space complexities compared to all other existing algorithms for frequent episode discovery, the non-overlapped occurrences-based frequency has another very important property. It facilitates a formal connection between discovering frequent serial episodes in data streams and learning or estimating a model for the data generation process in terms of certain kinds of Hidden Markov Models (HMMs). In order to establish this connection, a special class of HMMs, called Episode Generating HMMs (EGHs) are defined. The symbol set for the HMM is chosen to be the alphabet of event types, so that, the output of EGHs can be regarded as event streams in the frequent episode discovery framework. Given a serial episode, α, that occurs in the event stream, a method is proposed to uniquely associate it with an EGH, Λα. Consider two N-node serial episodes, α and β, whose (non-overlapped occurrences-based) frequencies in the given event stream, o, are fα and fβ respectively. Let Λα and Λβ be the EGHs associated with α and β. The main result connecting episodes and EGHs states that, the joint probability of o and the most likely state sequence for Λα is more than the corresponding probability for Λβ, if and only if, fα is greater than fβ. This theoretical connection has some interesting consequences. First of all, since the most frequent serial episode is associated with the EGH having the highest data likelihood, frequent episode discovery can now be interpreted as a generative model learning exercise. More importantly, it is now possible to derive a formal test of significance for serial episodes in the data, that prescribes, for a given size of the test, a minimum frequency for the episode needed in order to declare it as statistically significant. Note that this significance test for serial episodes does not require any separate model estimation (or training). The only quantity required to assess significance of an episode is its non-overlapped occurrences-based frequency (and this is obtained through the usual counting procedure). The significance test also helps to automatically fix the frequency threshold for the frequent episode discovery process, so that it can lead to what may be termed parameterless data mining. In the framework considered so far, the input to frequent episode discovery process is a sequence of instantaneous events. However, in many applications events tend to persist for different periods of time and the durations may carry important information from a data mining perspective. This thesis extends the framework of frequent episodes to incorporate such duration information directly into the definition of episodes, so that, the patterns discovered will now carry this duration information as well. Each event in this generalized framework looks like a triple, (Ei, ti, τi), where Ei, as earlier, is the event type (from some finite alphabet) corresponding to the ith event, and ti and τi denote the start and end times of this event. The new temporal pattern, called the generalized episode, is a quadruple, (V, <, g, d), where V, < and g, as earlier, respectively denote a collection of nodes, a partial order over this collection and a map assigning event types to nodes. The new feature in the generalized episode is d, which is a map from V to 2I, where, I denotes a collection of time interval possibilities for event durations, which is defined by the user. An occurrence of a generalized episode in the event sequence consists of events with both 'correct' event types and 'correct' time durations, appearing in the event sequence in 'correct' time order. All frequency definitions for episodes over instantaneous event streams are applicable for generalized episodes as well. The algorithms for frequent episode discovery also easily extend to the case of generalized episodes. The extra design choice that the user has in this generalized framework, is the set, I, of time interval possibilities. This can be used to orient and focus the frequent episode discovery process to come up with temporal correlations involving only time durations that are of interest. Through extensive simulations the utility and effectiveness of the generalized framework are demonstrated. The new algorithms for frequent episode discovery presented in this thesis are used to develop an application for temporal data mining of some data from car engine manufacturing plants. Engine manufacturing is a heavily automated and complex distributed controlled process with large amounts of faults data logged each day. The goal of temporal data mining here is to unearth some strong time-ordered correlations in the data which can facilitate quick diagnosis of root causes for persistent problems and predict major breakdowns well in advance. This thesis presents an application of the algorithms developed here for such analysis of the faults data. The data consists of time-stamped faults logged in car engine manufacturing plants of General Motors. Each fault is logged using an extensive list of codes (which constitutes the alphabet of event types for frequent episode discovery). Frequent episodes in fault logs represent temporal correlations among faults and these can be used for fault diagnosis in the plant. This thesis describes how the outputs from the frequent episode discovery framework, can be used to help plant engineers interpret the large volumes of faults logged, in an efficient and convenient manner. Such a system, based on the algorithms developed in this thesis, is currently being used in one of the engine manufacturing plants of General Motors. Some examples of the results obtained that were regarded as useful by the plant engineers are also presented. Data Mining Databases - Data Mining Hidden Markov Models Temporal Data Mining Episodes (Temporal Patterns) Event Sequences Frequent Episodes Episode Discovery Fast Algorithms Pattern Discovery Computer Science

1

Page generated in 0.0581 seconds