Global ETD Search

1	Data Mining On Architecture Simulation Maden, Engin 01 March 2010 (has links) (PDF) Data mining is the process of extracting patterns from huge data. One of the branches in data mining is mining sequence data and here the data can be viewed as a sequence of events and each event has an associated time of occurrence. Sequence data is modelled using episodes and events are included in episodes. The aim of this thesis work is analysing architecture simulation output data by applying episode mining techniques, showing the previously known relationships between the events in architecture and providing an environment to predict the performance of a program in an architecture before executing the codes. One of the most important points here is the application area of episode mining techniques. Architecture simulation data is a new domain to apply these techniques and by using the results of these techniques making predictions about the performance of programs in an architecture before execution can be considered as a new approach. For this purpose, by implementing three episode mining techniques which are WINEPI approach, non-overlapping occurrence based approach and MINEPI approach a data mining tool has been developed. This tool has three main components. These are data pre-processor, episode miner and output analyser. QA Computer Software 76.75-76.765
2	NEURAL NETWORK ON VIRTUALIZATION SYSTEM, AS A WAY TO MANAGE FAILURE EVENTS OCCURRENCE ON CLOUD COMPUTING Pham, Khoi Minh 01 June 2018 (has links) Cloud computing is one important direction of current advanced technology trends, which is dominating the industry in many aspects. These days Cloud computing has become an intense battlefield of many big technology companies, whoever can win this war can have a very high potential to rule the next generation of technologies. From a technical point of view, Cloud computing is classified into three different categories, each can provide different crucial services to users: Infrastructure (Hardware) as a Service (IaaS), Software as a Service (SaaS), and Platform as a Service (PaaS). Normally, the standard measurements for cloud computing reliability level is based on two approaches: Service Level Agreements (SLAs) and Quality of Service (QoS). This thesis will focus on IaaS cloud systems’ Error Event Logs as an aspect of QoS in IaaS cloud reliability. To have a better view, basically, IaaS is a derivation of the traditional virtualization system where multiple virtual machines (VMs) with different Operating System (OS) platforms, are run solely on one physical machine (PM) that has enough computational power. The PM will play the role of the host machine in cloud computing, and the VMs will play the role as the guest machines in cloud computing. Due to the lack of fully access to the complete real cloud system, this thesis will investigate the technical reliability level of IaaS cloud through simulated virtualization system. By collecting and analyzing the event logs generated from the virtualization system, we can have a general overview of the system’s technical reliability level based on number of error events occur in the system. Then, these events will be used on neural network time series model to detect the system failure events’ pattern, as well as predict the next error event that is going to occur in the virtualization system. neural network event sequence time series visualization cloud computing event log reliability episode mining back-propagation Computer and Systems Architecture Computer Engineering
3	Effective Characterization of Sequence Data through Frequent Episodes Ibrahim, A January 2015 (has links) (PDF) Pattern discovery is an important area of data mining referring to a class of techniques designed for the extraction of interesting patterns from the data. A pattern is some kind of a local structure that captures correlations and dependencies present in the elements of the data. In general, pattern discovery is about finding all patterns of `interest' in the data and a popular measure of interestingness for a pattern is its frequency of occurrence in the data. Thus the problem of frequent pattern discovery is to find all patterns in the data whose frequency of occurrence exceeds some user defined threshold. However, frequency of a pattern is not the only measure for finding patterns of interest and there also exist other measures and techniques for finding interesting patterns. This thesis is concerned with efficient discovery of inherent patterns from long sequence (or temporally ordered) data. Mining of such sequentially ordered data is called temporal data mining and the temporal patterns that are discovered from large sequential data are called episodes. More specifically, this thesis explores efficient methods for finding small and relevant subsets of episodes from sequence data that best characterize the data. The thesis also discusses methods for comparing datasets, based on comparing the sets of patterns representing the datasets. The data in a frequent episode discovery framework is abstractly viewed as a single long sequence of events. Here, the event is a tuple, (Ei; ti), where Ei is referred to as an event-type (taking values from a finite alphabet set) and ti is the time of occurrence. The events are ordered in the non-decreasing order of the time of occurrence. The pattern of interest in such a sequence is called an episode, which is a collection of event-types with a partial order defined over it. In this thesis, the focus is on a special type of episode called serial episode, where there is a total order defined among the collection of event-types representing the episode. The occurrence of an episode is essentially a subset of events from the data whose event-types match the set of eventtypes associated with the episode and the order in which they occur conforms to the underlying partial order of the episode. The frequency of an episode is some measure of how often it occurs in the event stream. Many different notions of frequency have been defined in literature. Given a frequency definition, the goal of frequent episode discovery is to unearth all episodes which have a frequency greater than a user-defined threshold. The size of an episode is the number of event-types in the episode. An episode β is called a subepisode of another episode β, if the collection of event-types of β is a subset of the corresponding collection of α and the event-types of β satisfy the same partial order relationships present among the corresponding event-types of α. The set of all episodes can be arranged in a partial order lattice, where each level i contains episodes of size i and the partial order is the subepisode relationship. In general, there are two approaches for mining frequent episodes, based on the way one traverses this lattice. The first approach is to traverse this lattice in a breadth-first manner, and is called the Apriori approach. The other approach is the Pattern growth approach, where the lattice is traversed in a depth-first manner. There exist different frequency notions for episodes, and many Apriori based algorithms have been proposed for mining frequent episodes under the different frequencies. However there do not exist Pattern-growth based methods for many of the frequency notions. The first part of the thesis proposes new Pattern-growth methods for discovering frequent serial episodes under two frequency notions called the non-overlapped frequency and the total frequency. Special cases, where certain additional conditions, called the span and gap constraints, are imposed on the occurrences of the episodes are also considered. The proposed methods, in general, consist of two steps: the candidate generation step and the counting step. The candidate generation step involves finding potential frequent episodes. This is done by following the general Pattern growth approach for finding the candidates, which is the depth-first traversal of the lattice of all episodes. The second step, which is the counting step, involves counting the frequencies of the episodes. The thesis presents efficient methods for counting the occurrences of serial episodes using occurrence windows of subepisodes for both the non-overlapped and total frequency. The relative advantages of Pattern-growth approaches over Apriori approaches are also discussed. Through detailed simulation results, the effectiveness of this approach on a host of synthetic and real data sets is shown. It is shown that the proposed methods are highly scalable and efficient in runtime as compared to the existing Apriori approaches. One of the main issues in frequent pattern mining is the huge number of frequent patterns, returned by the discovery methods, irrespective of the approach taken to solve the problems. The second part of this thesis, addresses this issue and discusses methods of selecting a small subset of relevant episodes from event sequences. There have been a few approaches, discussed in the literature, for finding a small subset of patterns. One set of methods are information theory based methods, where patterns that provide maximum information are searched for. Another approach is the Minimum Description Length (MDL) principle based summarization schemes. Here the data is encoded using a subset of patterns (which forms the model for the data) and its occurrences. The subset of patterns that has the maximum efficiency in encoding the data is the best representative model for the data. The MDL principle takes into account both the encoding efficiency of the model as well as model complexity. A method, called Constrained Serial episode Coding(CSC), is proposed based on the MDL principle, which returns a highly relevant, non-redundant and small subset of serial episodes. This also includes an encoding scheme, where the model representation and the encoding of the data are efficient. An interesting feature of this algorithm for isolating a small set of relevant episodes is that it does not need a user-specified threshold on frequency. The effectiveness of this method is shown on two types of data. The first is data obtained from a detailed simulator for a reconfigurable coupled conveyor system. The conveyor system consists of different intersecting paths and packages flow through such a network. Mining of such data can allow one to unearth the main paths of package ows which can be useful in remote monitoring and visualization of the system. On this data, it is shown that the proposed method is able to return highly consistent sub paths, in the form of serial episodes, with great encoding efficiency as compared to other known related sequence summarization schemes, like SQS and GoKrimp. The second type of data consists of a collection of multi-class sequence datasets. It is shown that the selected episodes from the proposed method form good features in classi cation. The proposed method is compared with SQS and GoKrimp, and it is shown that the episodes selected by this method help in achieving better classification results as compared to other methods. The third and nal part of the thesis discusses methods for comparing sets of patterns representing different datasets. There are many instances when one is interested in comparing datasets. For example, in streaming data, one is interested in knowing whether the characteristics of the data are the same or have changed significantly. In other cases, one may simply like to compare two datasets and quantify the degree of similarity between them. Often, data are characterized by a set of patterns as described above. Comparing sets of patterns representing datasets gives information about the similarity/dissimilarity between the datasets. However not many measures exist for comparing sets of patterns. This thesis proposes a similarity measure for comparing sets of patterns which in turn aids in comparison of di erent datasets. First, a kernel for comparing two patterns, called the Pattern Kernel, is proposed. This kernel is proposed for three types of patterns: serial episodes, sequential patterns and itemsets. Using this kernel, a Pattern Set Kernel is proposed for comparing different sets of patterns. The effectiveness of this kernel is shown in classification and change detection. The thesis concludes with a summary of the main contributions and some suggestions for extending the work presented here. Data Mining Pattern Discovery Pattern Mining Sequencial Pattern Episode Formalism Episode Discovery Pattern Set Kernel Episodes Pattern Kernel Frequent Episode Mining Electrical Engineering
4	Discovering Frequent Episodes With General Partial Orders Achar, Avinash 12 1900 (has links) (PDF) Pattern Discovery, a popular paradigm in data mining refers to a class of techniques that try and extract some unknown or interesting patterns from data. The work carried out in this thesis concerns frequent episode mining, a popular framework within pattern discovery, with applications in alarm management, fault analysis, network reconstruction etc. The data here is in the form of a single longtime-ordered stream of events. The pattern of interest here, namely episode, is basically a set of event-types with a partial order on it. The task here is to unearth all patterns( episodes here) which have a frequency above a user-defined threshold irrespective of pattern size. Most current discovery algorithms employ a level-wise a priori-based method for mining, which basically adopts a breadth-first search strategy of the space of all episodes. The episode literature has seen multiple ways of defining frequency with each definition having its own set of merits and demerits. The main reason for different frequencies definitions being proposed is that, in general, counting all occurrences of a set of episodes is computationally very expensive. The first part of the thesis gives a unified view of all the apriori-based discovery algorithms for serial episodes(associated with a total order)under these various frequencies. Specifically, the various existing counting algorithms can be viewed as minor modifications of each other. We also provide some novel proofs of correctness for some of the serial episode counting schemes, which in turn can be generalized to episodes with general partial orders. Our unified view helps us derive quantitative relationships between different frequencies. We also discuss all the anti-monotonicity properties satisfied by the various frequencies, a crucial information needed for the candidate generation step. The second part of the thesis proposes discovery algorithms for episodes with general partial orders, for which no algorithms currently exist in literature. The discovery algorithm proposed is apriori-based and generalizes the existing serial and parallel (associated with a trivial order) episode algorithms. The discovery algorithm is a level-wise procedure involving the steps of candidate generation and counting a teach level. In the context of general partial orders, a major problem in a priori-based discovery is to have an efficient candidate generation scheme. We present a novel candidate generation algorithm for mining episodes with general partial orders. The counting algorithm design for general partial order episodes draws ideas from the unified view of counting for serial episodes, presented in the first part of the work. We formally show the correctness of the proposed candidate generation and counting steps for general partial orders. The proposed candidate generation algorithm is flexible enough to be able to mine in certain specialized classes of partial orders (satisfying what we call maximal sub episode property), of which, the serial and parallel class of episodes are two specific instances. Our algorithm design initially restricts itself to the class of general partial order episodes called injective episodes wherein repeated event-types are not allowed. We then generalize this to a larger class of episodes called chain episodes, where episodes can have some repeated event types. The class of chain episodes contains all (including non-injective) serial and parallel episodes and thus our method properly generalizes the existing methods for serial and parallel episode discovery. We also discuss some problems in extending our algorithms to episodes beyond the class of chain episodes. Also, we demonstrate that frequency alone is not a sufficient enough interestingness measure for episodes with unrestricted partial orders. To address this issue, we propose an additional measure called bidirectional evidence to assess interestingness which, along with frequency is found to be extremely effective in unearthing interesting patterns. In the frequent episode framework, the choice of thresholds are most often user-defined and arbitrary. To address this issue, the last part of the work deals with assessing significance of partial order episodes in a statistical sense based on ideas from classical hypothesis testing. We declare an episode to be significant if its observed frequency in the data stream is large enough to be very unlikely, under a random i.i.d model .The key step in the significance analysis involves the mean and variance computation of the the time between successive occurrences of the pattern. This computation can be reformulated as, solving for the mean and variance of the first visit time to a particular stat e in an associated Markov chain. We use a generating function approach to solve for this mean and variance. Using this and a Gaussian approximation to the frequency random variable, we can now calculate a frequency threshold for any partial order episode, beyond which we infer it to be significant. Our significance analysis for general partial order episodes generalizes the existing significance analysis of serial episode patterns. We demonstrate on synthetic data the effectiveness of our significance thresholds. Datamining With Partial Orders Datamining Pattern Discovery Partial Order Episodes Episode Discovery General Partial Order Episodes Apriori-Based Episode Discovery Episode Mining Frequent Episode Discovery Computer Science

1

Page generated in 0.0624 seconds