Global ETD Search

11	Apprentissage de règles associatives temporelles pour les séquences temporelles de symboles / Learning temporal association rules on Symbolic time sequences Guillame-Bert, Mathieu 23 November 2012 (has links) L'apprentissage de modèles temporels constitue l'une des grandes problématiques de l'Exploration de Données (Data Mining). Dans cette thèse, nous avons développé un nouveau modèle temporel appelé TITA Rules (Règle associative temporelle basé sur des arbres d'intervalles). Ce modèle permet de décrire des phénomènes ayant un certain degré d'incertitude et/ou d'imprécision. Ce modèle permet entre autres d'exprimer la synchronicité entre évènements, les contraintes temporelles disjonctives et la négation temporelle. De par leur nature, les TITA Rules peuvent êtes utilisées pour effectuer des prédictions avec une grande précision temporel. Nous avons aussi développé un algorithme capable de découvrir et d'extraire de manière efficace des TITA Rules dans de grandes bases de données temporelles. Le cœur de l'algorithme est basé sur des techniques de minimisation d'entropie, de filtrage par Apriori et par des analyses de co-dépendance. Note modèle temporelle et notre algorithme ont été appliqués et évalués sur plusieurs jeux de données issues de phénomènes réels et de phénomènes simulés. La seconde partie de cette thèse à consisté à étudier l'utilisation de notre modèle temporel sur la problématique de la Planification Automatique. Ces travaux ont mené au développement d'un algorithme de planification automatique. L'algorithme prend en entrée un ensemble de TITA Rules décrivant le fonctionnement d'un système quelconque, une description de l'état initial du système, et un but à atteindre. En retour, l'algorithme calcule un plan décrivant la meilleure façon d'atteindre le but donné. Par la nature même des TITA Rules, cet algorithme est capable de gérer l'incertain (probabilités), l'imprécision temporelle, les contraintes temporelles disjonctives, ainsi que les événements exogènes prédictibles mais imprécis. / The learning of temporal patterns is a major challenge of Data mining. We introduce a temporal pattern model called Temporal Interval Tree Association Rules (Tita rules or Titar). This pattern model can be used to express both uncertainty and temporal inaccuracy of temporal events. Among other things, Tita rules can express the usual time point operators, synchronicity, order, and chaining,disjunctive time constraints, as well as temporal negation. Tita rules are designed to allow predictions with optimum temporal precision. Using this representation, we present the Titar learner algorithm that can be used to extract Tita rules from large datasets expressed as Symbolic Time Sequences. This algorithm based on entropy minimization, apriori pruning and statistical dependence analysis. We evaluate our technique on simulated and real world datasets. The problem of temporal planning with Tita rules is studied. We use Tita rules as world description models for a Planning and Scheduling task. We present an efficient temporal planning algorithm able to deal with uncertainty, temporal inaccuracy, discontinuous (or disjunctive) time constraints and predictable but imprecisely time located exogenous events. We evaluate our technique by joining a learning algorithm and our planning algorithm into a simple reactive cognitive architecture that we apply to control a robot in a virtual world. Apprentissage automatique Exploration de données temporelles Incertitude temporelle Planification automatique Contraintes temporelles disjonctives Machine Learning Temporal Data Mining Uncertainty Automated planning and scheduling Disjunctive Temporal Constraints
12	Discovering Frequent Episodes : Fast Algorithms, Connections With HMMs And Generalizations Laxman, Srivatsan 03 1900 (has links) Temporal data mining is concerned with the exploration of large sequential (or temporally ordered) data sets to discover some nontrivial information that was previously unknown to the data owner. Sequential data sets come up naturally in a wide range of application domains, ranging from bioinformatics to manufacturing processes. Pattern discovery refers to a broad class of data mining techniques in which the objective is to unearth hidden patterns or unexpected trends in the data. In general, pattern discovery is about finding all patterns of 'interest' in the data and one popular measure of interestingness for a pattern is its frequency in the data. The problem of frequent pattern discovery is to find all patterns in the data whose frequency exceeds some user-defined threshold. Discovery of temporal patterns that occur frequently in sequential data has received a lot of attention in recent times. Different approaches consider different classes of temporal patterns and propose different algorithms for their efficient discovery from the data. This thesis is concerned with a specific class of temporal patterns called episodes and their discovery in large sequential data sets. In the framework of frequent episode discovery, data (referred to as an event sequence or an event stream) is available as a single long sequence of events. The ith event in the sequence is an ordered pair, (Et,tt), where Et takes values from a finite alphabet (of event types), and U is the time of occurrence of the event. The events in the sequence are ordered according to these times of occurrence. An episode (which is the temporal pattern considered in this framework) is a (typically) short partially ordered sequence of event types. Formally, an episode is a triple, (V,<,9), where V is a collection of nodes, < is a partial order on V and 9 is a map that assigns an event type to each node of the episode. When < is total, the episode is referred to as a serial episode, and when < is trivial (or empty), the episode is referred to as a parallel episode. An episode is said to occur in an event sequence if there are events in the sequence, with event types same as those constituting the episode, and with times of occurrence respecting the partial order in the episode. The frequency of an episode is some measure of how often it occurs in the event sequence. Given a frequency definition for episodes, the task is to discover all episodes whose frequencies exceed some threshold. This is done using a level-wise procedure. In each level, a candidate generation step is used to combine frequent episodes from the previous level to build candidates of the next larger size, and then a frequency counting step makes one pass over the event stream to determine frequencies of all the candidates and thus identify the frequent episodes. Frequency counting is the main computationally intensive step in frequent episode discovery. Choice of frequency definition for episodes has a direct bearing on the efficiency of the counting procedure. In the original framework of frequent episode discovery, episode frequency is defined as the number of fixed-width sliding windows over the data in which the episode occurs at least once. Under this frequency definition, frequency counting of a set of \|C\| candidate serial episodes of size N has space complexity O(N\|C\|) and time complexity O(ΔTN\|C\|) (where ΔT is the difference between the times of occurrence of the last and the first event in the data stream). The other main frequency definition available in the literature, defines episode frequency as the number of minimal occurrences of the episode (where, a minimal occurrence is a window on the time axis containing an occurrence of the episode, such that, no proper sub-window of it contains another occurrence of the episode). The algorithm for obtaining frequencies for a set of \|C\| episodes needs O(n\|C\|) time (where n denotes the number of events in the data stream). While this is time-wise better than the the windows-based algorithm, the space needed to locate minimal occurrences of an episode can be very high (and is in fact of the order of length, n, of the event stream). This thesis proposes a new definition for episode frequency, based on the notion of, what is called, non-overlapped occurrences of episodes in the event stream. Two occurrences are said to be non-overlapped if no event corresponding to one occurrence appears in between events corresponding to the other. Frequency of an episode is defined as the maximum possible number of non-overlapped occurrences of the episode in the data. The thesis also presents algorithms for efficient frequent episode discovery under this frequency definition. The space and time complexities for frequency counting of serial episodes are O(\|C\|) and O(n\|C\|) respectively (where n denotes the total number of events in the given event sequence and \|C\| denotes the num-ber of candidate episodes). These are arguably the best possible space and time complexities for the frequency counting step that can be achieved. Also, the fact that the time needed by the non-overlapped occurrences-based algorithm is linear in the number of events, n, in the event sequence (rather than the difference, ΔT, between occurrence times of the first and last events in the data stream, as is the case with the windows-based algorithm), can result in considerable time advantage when the number of time ticks far exceeds the number of events in the event stream. The thesis also presents efficient algorithms for frequent episode discovery under expiry time constraints (according to which, an occurrence of an episode can be counted for its frequency only if the total time span of the occurrence is less than a user-defined threshold). It is shown through simulation experiments that, in terms of actual run-times, frequent episode discovery under the non-overlapped occurrences-based frequency (using the algorithms developed here) is much faster than existing methods. There is also a second frequency measure that is proposed in this thesis, which is based on, what is termed as, non-interleaved occurrences of episodes in the data. This definition counts certain kinds of overlapping occurrences of the episode. The time needed is linear in the number of events, n, in the data sequence, the size, N, of episodes and the number of candidates, \|C\|. Simulation experiments show that run-time performance under this frequency definition is slightly inferior compared to the non-overlapped occurrences-based frequency, but is still better than the run-times under the windows-based frequency. This thesis also establishes the following interesting property that connects the non-overlapped, the non-interleaved and the minimal occurrences-based frequencies of an episode in the data: the number of minimal occurrences of an episode is bounded below by the maximum number of non-overlapped occurrences of the episode, and is bounded above by the maximum number of non-interleaved occurrences of the episode in the data. Hence, non-interleaved occurrences-based frequency is an efficient alternative to that based on minimal occurrences. In addition to being superior in terms of both time and space complexities compared to all other existing algorithms for frequent episode discovery, the non-overlapped occurrences-based frequency has another very important property. It facilitates a formal connection between discovering frequent serial episodes in data streams and learning or estimating a model for the data generation process in terms of certain kinds of Hidden Markov Models (HMMs). In order to establish this connection, a special class of HMMs, called Episode Generating HMMs (EGHs) are defined. The symbol set for the HMM is chosen to be the alphabet of event types, so that, the output of EGHs can be regarded as event streams in the frequent episode discovery framework. Given a serial episode, α, that occurs in the event stream, a method is proposed to uniquely associate it with an EGH, Λα. Consider two N-node serial episodes, α and β, whose (non-overlapped occurrences-based) frequencies in the given event stream, o, are fα and fβ respectively. Let Λα and Λβ be the EGHs associated with α and β. The main result connecting episodes and EGHs states that, the joint probability of o and the most likely state sequence for Λα is more than the corresponding probability for Λβ, if and only if, fα is greater than fβ. This theoretical connection has some interesting consequences. First of all, since the most frequent serial episode is associated with the EGH having the highest data likelihood, frequent episode discovery can now be interpreted as a generative model learning exercise. More importantly, it is now possible to derive a formal test of significance for serial episodes in the data, that prescribes, for a given size of the test, a minimum frequency for the episode needed in order to declare it as statistically significant. Note that this significance test for serial episodes does not require any separate model estimation (or training). The only quantity required to assess significance of an episode is its non-overlapped occurrences-based frequency (and this is obtained through the usual counting procedure). The significance test also helps to automatically fix the frequency threshold for the frequent episode discovery process, so that it can lead to what may be termed parameterless data mining. In the framework considered so far, the input to frequent episode discovery process is a sequence of instantaneous events. However, in many applications events tend to persist for different periods of time and the durations may carry important information from a data mining perspective. This thesis extends the framework of frequent episodes to incorporate such duration information directly into the definition of episodes, so that, the patterns discovered will now carry this duration information as well. Each event in this generalized framework looks like a triple, (Ei, ti, τi), where Ei, as earlier, is the event type (from some finite alphabet) corresponding to the ith event, and ti and τi denote the start and end times of this event. The new temporal pattern, called the generalized episode, is a quadruple, (V, <, g, d), where V, < and g, as earlier, respectively denote a collection of nodes, a partial order over this collection and a map assigning event types to nodes. The new feature in the generalized episode is d, which is a map from V to 2I, where, I denotes a collection of time interval possibilities for event durations, which is defined by the user. An occurrence of a generalized episode in the event sequence consists of events with both 'correct' event types and 'correct' time durations, appearing in the event sequence in 'correct' time order. All frequency definitions for episodes over instantaneous event streams are applicable for generalized episodes as well. The algorithms for frequent episode discovery also easily extend to the case of generalized episodes. The extra design choice that the user has in this generalized framework, is the set, I, of time interval possibilities. This can be used to orient and focus the frequent episode discovery process to come up with temporal correlations involving only time durations that are of interest. Through extensive simulations the utility and effectiveness of the generalized framework are demonstrated. The new algorithms for frequent episode discovery presented in this thesis are used to develop an application for temporal data mining of some data from car engine manufacturing plants. Engine manufacturing is a heavily automated and complex distributed controlled process with large amounts of faults data logged each day. The goal of temporal data mining here is to unearth some strong time-ordered correlations in the data which can facilitate quick diagnosis of root causes for persistent problems and predict major breakdowns well in advance. This thesis presents an application of the algorithms developed here for such analysis of the faults data. The data consists of time-stamped faults logged in car engine manufacturing plants of General Motors. Each fault is logged using an extensive list of codes (which constitutes the alphabet of event types for frequent episode discovery). Frequent episodes in fault logs represent temporal correlations among faults and these can be used for fault diagnosis in the plant. This thesis describes how the outputs from the frequent episode discovery framework, can be used to help plant engineers interpret the large volumes of faults logged, in an efficient and convenient manner. Such a system, based on the algorithms developed in this thesis, is currently being used in one of the engine manufacturing plants of General Motors. Some examples of the results obtained that were regarded as useful by the plant engineers are also presented. Data Mining Databases - Data Mining Hidden Markov Models Temporal Data Mining Episodes (Temporal Patterns) Event Sequences Frequent Episodes Episode Discovery Fast Algorithms Pattern Discovery Computer Science
13	應用在空間認知發展的學習歷程分析之高效率空間探勘演算法 / Efficient Mining of Spatial Co-orientation Patterns for Analyzing Portfolios of Spatial Cognitive Development 魏綾音, WEI, LING-YIN Unknown Date (has links) 空間認知(Spatial Cognition)指出人所理解的空間複雜度，也就是人與環境互動的過程中，經由記憶與感官經驗，透過內化與重建產生物體在空間的關係認知。認知圖(Cognitive Map)是最常被使用在評估空間認知。分析學生所畫的認知圖有助於老師們瞭解學生的空間認知能力，進而擬定適當的地理教學設計。我們視空間認知發展的學習歷程檔案是由這些認知圖所構成。隨著數位學習科技的進步，我們可以透過探勘認知圖的方式，探討空間認知發展的學習歷程檔案。因此，我們藉由透過圖像的空間資料探勘，分析學生空間認知發展的學習歷程。空間資料探勘(Spatial Data Mining)主要是從空間資料庫或圖像資料庫中找出有趣且有意義的樣式。在論文中，我們介紹一種空間樣式(Spatial Co-orientation Pattern)探勘以提供空間認知發展學習歷程的分析。Spatial Co-orientation Pattern是指圖像資料庫中，具有共同相對方向關係的物體(Object)常一起出現。例如，我們可以從圖像資料庫中發現物體P常出現在物體Q的左邊，我們利用二維字串(2D String)來表示物體分佈在圖像中的空間方向關係。我們透過Pattern-growth的方法探勘此種空間樣式，藉由實驗結果呈現Pattern-growth的方法與過去Apriori-based的方法[14]之優缺點。我們延伸Spatial Co-orientation Pattern的概念至時空資料庫(Spatio-temporal Database)，提出從時空資料庫中，探勘Temporal Co-orientation Pattern。Temporal Co-orientation Pattern是指Spatial Co-orientation Pattern隨著時間的變化。論文中，我們提出兩種此類樣式，即是Coarse Temporal Co-orientation Pattern與Fine Temporal Co-orientation Pattern。針對此兩種樣式，我們提出三階段(three-stage)演算法，透過實驗分析演算法的效率。 / Spatial cognition means how human interpret spatial complexity. Cognitive maps are mostly used to test the spatial cognition. Analyzing cognitive maps drawn by students is helpful for teachers to understand students’ spatial cognitive ability and to draft geography teaching plans. Cognitive maps constitute the portfolios of spatial cognitive development. With the advance of e-learning technology, we can analyze portfolios of spatial cognitive development by spatial data mining of cognitive images. Therefore, we can analyze portfolios of spatial cognitive development by spatial data mining of images. Spatial data mining is an important task to discover interesting and meaningful patterns from spatial or image databases. In this thesis, we investigate the spatial co-orientation patterns for analyzing portfolios of spatial cognitive development. Spatial co-orientation patterns refer to objects that frequently occur with the same spatial orientation, e.g. left, right, below, etc., among images. For example, an object P is frequently left to an object Q among images. We utilize the data structure, 2D string, to represent the spatial orientation of objects. We propose the pattern-growth approach for mining co-orientation patterns. An experimental evaluation with synthetic datasets shows the advantages and disadvantages between pattern-growth approach and Apriori-based approach proposed by Huang [14]. Moreover, we extend the concept of spatial co-orientation pattern to that of temporal patterns. Temporal co-orientation patterns refer to the change of spatial co-orientation patterns over time. Two temporal patterns, the coarse temporal co-orientation patterns and fine temporal co-orientation patterns are introduced to be extracted from spatio-temporal databases. We propose the three-stage algorithms, CTPMiner and FTPMiner, for mining coarse and fine temporal co-orientation patterns, respectively. An experimental evaluation with synthetic datasets shows the performance of these algorithms. 空間認知認知圖空間資料探勘時空資料探勘 Spatial Cognition Cognitive Map Spatial Data Mining Spatio-temporal Data Mining

Page generated in 0.0945 seconds