Global ETD Search

21	Signatures : detecting and characterizing complex recurrent behavior in sequential data / Détection et caractérisation de comportements complexes récurrents dans des données séquentielles Gautrais, Clément 16 October 2018 (has links) Cette thèse introduit un nouveau type de motif appelé signature. La signature segmente une séquence d'itemsets, afin de maximiser la taille de l'ensemble d'items qui apparaît dans tous les segments. La signature a été initialement introduite pour identifier les produits favoris d'un consommateur de supermarché à partir de son historique d'achat. L'originalité de la signature vient du fait qu'elle identifie les items récurrents qui 1) peuvent apparaître à différentes échelles temporelles, 2) peuvent avoir des occurrences irrégulières et 3) peuvent être rapidement compris par des analystes. Étant donné que les approches existantes en fouille de motifs n'ont pas ces 3 propriétés, nous avons introduit la signature. En comparant la signature avec les méthodes de l'état de l'art, nous avons montré que la signature est capable d'identifier de nouvelles régularités dans les données, tout en identifiant les régularités détectées par les méthodes existantes. Bien qu'initialement liée au domaine de la fouille de motifs, nous avons également lié le problème de la fouille de signatures au domaine de la segmentation de séquences. Nous avons ensuite défini différents algorithmes, utilisant des méthodes liées à la fouille de motifs et à la segmentation de séquences. Les signatures ont été utilisées pour analyser un large jeu de données issu d'un supermarché français. Une analyse qualitative des signatures calculées sur ces consommateurs réels a montré que les signatures sont capables d'identifier les produits favoris d'un consommateur. Les signatures ont également été capables de détecter et de caractériser l'attrition de consommateurs. Cette thèse définit également 2 extensions de la signature. La première extension est appelée la sky-signature. La sky-signature permet de présenter les items récurrents d'une séquence à différentes échelles de temps. La sky-signature peut être vue comme une manière efficace de résumer les signatures calculées à toutes les échelles de temps possibles. Les sky-signatures ont été utilisées pour analyser les discours de campagne des candidats à la présidentielle américaine de 2016. Les sky-signatures ont identifié les principaux thèmes de campagne de chaque candidat, ainsi que leur rythme de campagne. Cette analyse a également montré que les signatures peuvent être utilisées sur d'autres types de jeux de données. Cette thèse introduit également une deuxième extension de la signature, qui permet de calculer la signature qui correspond le plus aux données. Cette extension utilise une technique de sélection de modèle basée sur le principe de longueur de description minimale, communément utilisée en fouille de motifs. Cette extension a également été utilisée pour analyser des consommateurs de supermarché. / Cette thèse introduit un nouveau type de motif appelé signature. La signature segmente une séquence d'itemsets, afin de maximiser la taille de l'ensemble d'items qui apparaît dans tous les segments. La signature a été initialement introduite pour identifier les produits favoris d'un consommateur de supermarché à partir de son historique d'achat. L'originalité de la signature vient du fait qu'elle identifie les items récurrents qui 1) peuvent apparaître à différentes échelles temporelles, 2) peuvent avoir des occurrences irrégulières et 3) peuvent être rapidement compris par des analystes. Étant donné que les approches existantes en fouille de motifs n'ont pas ces 3 propriétés, nous avons introduit la signature. En comparant la signature avec les méthodes de l'état de l'art, nous avons montré que la signature est capable d'identifier de nouvelles régularités dans les données, tout en identifiant les régularités détectées par les méthodes existantes. Bien qu'initialement liée au domaine de la fouille de motifs, nous avons également lié le problème de la fouille de signatures au domaine de la segmentation de séquences. Nous avons ensuite défini différents algorithmes, utilisant des méthodes liées à la fouille de motifs et à la segmentation de séquences. Les signatures ont été utilisées pour analyser un large jeu de données issu d'un supermarché français. Une analyse qualitative des signatures calculées sur ces consommateurs réels a montré que les signatures sont capables d'identifier les produits favoris d'un consommateur. Les signatures ont également été capables de détecter et de caractériser l'attrition de consommateurs. Cette thèse définit également 2 extensions de la signature. La première extension est appelée la sky-signature. La sky-signature permet de présenter les items récurrents d'une séquence à différentes échelles de temps. La sky-signature peut être vue comme une manière efficace de résumer les signatures calculées à toutes les échelles de temps possibles. Les sky-signatures ont été utilisées pour analyser les discours de campagne des candidats à la présidentielle américaine de 2016. Les sky-signatures ont identifié les principaux thèmes de campagne de chaque candidat, ainsi que leur rythme de campagne. Cette analyse a également montré que les signatures peuvent être utilisées sur d'autres types de jeux de données. Cette thèse introduit également une deuxième extension de la signature, qui permet de calculer la signature qui correspond le plus aux données. Cette extension utilise une technique de sélection de modèle basée sur le principe de longueur de description minimale, communément utilisée en fouille de motifs. Cette extension a également été utilisée pour analyser des consommateurs de supermarché. Exploration de données Analyse des données symboliques Bases de données temporelles Data Mining Pattern Mining Sequential Data
22	Feature extraction and similarity-based analysis for proteome and genome databases Öztürk, Özgür. January 2007 (has links) Thesis (Ph. D.)--Ohio State University, 2007. / Title from first page of PDF file. Includes bibliographical references (p. 108-119).
23	Mining Mobile Group Patterns Using Trajectory Approximation Huang, Chin-Ming 29 July 2004 (has links) In this paper, we present a novel approach to mine moving object group patterns from object movement database. At first, our approaches summarize the raw data in the source object movement database into trajectories, and then discover valid 2-groups mainly from the trajectory-based object movement database. We propose two trajectory conversion methods, namely linear regression and vector conversion. We further propose a trajectory based mobile group mining algorithm that is intended to reduce the overhead of mining 2-Group Patterns. The use of trajectories allows valid 2-groups to be mined using smaller number of summarized records (in trajectory model) and examining smaller number of candidate 2-groups. Finally, we conduct series of comprehensive experiments to evaluate and compare the performances of the proposed methods with existing approaches that use source object movement database or other summarization techniques. The experimental results demonstrate the superior performance of our proposed approach. mobile group pattern group pattern mining trajectory approximation trajectory mobile data mining
24	A Classification System For The Problem Of Protein Subcellular Localization Alay, Gokcen 01 September 2007 (has links) (PDF) The focus of this study is on predicting the subcellular localization of a protein. Subcellular localization information is important for protein function annotation which is a fundamental problem in computational biology. For this problem, a classification system is built that has two main parts: a predictor that is based on a feature mapping technique to extract biologically meaningful information from protein sequences and a client/server architecture for searching and predicting subcellular localizations. In the first part of the thesis, we describe a feature mapping technique based on frequent patterns. In the feature mapping technique we describe, frequent patterns in a protein sequence dataset were identified using a search technique based on a priori property and the distribution of these patterns over a new sample is used as a feature vector for classification. The effect of a number of feature selection methods on the classification performance is investigated and the best one is applied. The method is assessed on the subcellular localization prediction problem with 4 compartments (Endoplasmic reticulum (ER) targeted, cytosolic, mitochondrial, and nuclear) and the dataset is the same used in P2SL. Our method improved the overall accuracy to 91.71% which was originally 81.96% by P2SL. In the second part of the thesis, a client/server architecture is designed and implemented based on Simple Object Access Protocol (SOAP) technology which provides a user-friendly interface for accessing the protein subcellular localization predictions. Client part is in fact a Cytoscape plug-in that is used for functional enrichment of biological networks. Instead of the individual use of subcellular localization information, this plug-in lets biologists to analyze a set of genes/proteins under system view. QA Computer Software 76.75-76.765
25	Data Mining On Architecture Simulation Maden, Engin 01 March 2010 (has links) (PDF) Data mining is the process of extracting patterns from huge data. One of the branches in data mining is mining sequence data and here the data can be viewed as a sequence of events and each event has an associated time of occurrence. Sequence data is modelled using episodes and events are included in episodes. The aim of this thesis work is analysing architecture simulation output data by applying episode mining techniques, showing the previously known relationships between the events in architecture and providing an environment to predict the performance of a program in an architecture before executing the codes. One of the most important points here is the application area of episode mining techniques. Architecture simulation data is a new domain to apply these techniques and by using the results of these techniques making predictions about the performance of programs in an architecture before execution can be considered as a new approach. For this purpose, by implementing three episode mining techniques which are WINEPI approach, non-overlapping occurrence based approach and MINEPI approach a data mining tool has been developed. This tool has three main components. These are data pre-processor, episode miner and output analyser. QA Computer Software 76.75-76.765
26	Dažnų sekų analizė sprendimų priėmimui labai didelėse duomenų bazėse / Frequent pattern analysis for decision making in big data Pragarauskaitė, Julija 01 July 2013 (has links) Didžiuliai informacijos kiekiai yra sukaupiami kiekvieną dieną pasaulyje bei jie sparčiai auga. Apytiksliai duomenų tyrybos algoritmai yra labai svarbūs analizuojant tokius didelius duomenų kiekius, nes algoritmų greitis yra ypač svarbus daugelyje sričių, tuo tarpu tikslieji metodai paprastai yra lėti bei naudojami tik uždaviniuose, kuriuose reikalingas tikslus atsakymas. Ši disertacija analizuoja kelias duomenų tyrybos sritis: dažnų sekų paiešką bei vizualizaciją sprendimų priėmimui. Dažnų sekų paieškai buvo pasiūlyti trys nauji apytiksliai metodai, kurie buvo testuojami naudojant tikras bei dirbtinai sugeneruotas duomenų bazes: • Atsitiktinės imties metodas (Random Sampling Method - RSM) formuoja pradinės duomenų bazės atsitiktinę imtį ir nustato dažnas sekas, remiantis atsitiktinės imties analizės rezultatais. Šio metodo privalumas yra teorinis paklaidų tikimybių įvertinimas, naudojantis standartiniais statistiniais metodais. • Daugybinio perskaičiavimo metodas (Multiple Re-sampling Method - MRM) yra RSM metodo patobulinimas, kuris formuoja kelias pradinės duomenų bazės atsitiktines imtis ir taip sumažina paklaidų tikimybes. • Markovo savybe besiremiantis metodas (Markov Property Based Method - MPBM) kelis kartus skaito pradinę duomenų bazę, priklausomai nuo Markovo proceso eilės, bei apskaičiuoja empirinius dažnius remdamasis Markovo savybe. Didelio duomenų kiekio vizualizavimui buvo naudojami pirkėjų internetu elgsenos duomenys, kurie analizuojami naudojant... [toliau žr. visą tekstą] / Huge amounts of digital information are stored in the World today and the amount is increasing by quintillion bytes every day. Approximate data mining algorithms are very important to efficiently deal with such amounts of data due to the computation speed required by various real-world applications, whereas exact data mining methods tend to be slow and are best employed where the precise results are of the highest important. This thesis focuses on several data mining tasks related to analysis of big data: frequent pattern mining and visual representation. For mining frequent patterns in big data, three novel approximate methods are proposed and evaluated on real and artificial databases: • Random Sampling Method (RSM) creates a random sample from the original database and makes assumptions on the frequent and rare sequences based on the analysis results of the random sample. A significant benefit is a theoretical estimate of classification errors made by this method using standard statistical methods. • Multiple Re-sampling Method (MRM) is an improved version of RSM method with a re-sampling strategy that decreases the probability to incorrectly classify the sequences as frequent or rare. • Markov Property Based Method (MPBM) relies upon the Markov property. MPBM requires reading the original database several times (the number equals to the order of the Markov process) and then calculates the empirical frequencies using the Markov property. For visual representation... [to full text] Informatics Dažnų sekų paieška Duomenų analizė Vizualizavimas Frequent pattern mining Data analysis Visualization
27	Heavyweight Pattern Mining in Attributed Flow Graphs Simoes Gomes, Carolina Unknown Date No description available. data mining program analysis flow graph pattern mining sub-graph mining program profiling software analysis
28	Mining mobile object trajectories: frameworks and algorithms Han, Binh Thi 12 January 2015 (has links) The proliferation of mobile devices and advances in geo-positioning technologies has fueled the growth of location-based applications, systems and services. Many location-based applications have now gained high popularity and permeated the daily activities of mobile users. This has led to a huge amount of geo-location data generated on a daily basis, which draws significant interests in analyzing and mining ubiquitous location data, especially trajectories of mobile objects moving in road networks (MO trajectories). Mobile trajectories are complex spatio-temporal sequences of location points with varying sample sizes and varying lengths. Mining interesting patterns from large collection of complex MO trajectories presents interesting challenges and opportunities which can reveal valuable insights to the studies of human mobility in many perspectives. This dissertation research contributes original ideas and innovative techniques for mining complex trajectories from whole trajectories, from subtrajectories of significant characteristics, and from semantic location sequences within large-scale datasets of MO trajectories. Concretely, the first unique contribution of this dissertation is the development of NEAT, a three-phase road-network aware trajectory clustering framework to organize MO subtrajectories into spatial clusters representing highly dense and highly continuous traffic flows in a road network. Compared with existing trajectory clustering approaches, NEAT yields highly accurate clustering results and runs orders of magnitude faster by smartly utilizing traffic locality with respect to physical constraints of the road network, traffic flows among consecutive road segments and flow-based density of mobile traffic as well as road network based distances. The second original contribution of this dissertation is the design and development of TraceMob, a methodical and high performance framework for clustering whole trajectories of mobile objects. To our best knowledge, this is the first whole trajectory clustering system for MO trajectories in road networks. The core idea of TraceMob is to develop a road-network aware transformation algorithm that can map complex trajectories of varying lengths from a road network space into a multidimensional data space while preserving the relative distances between complex trajectories in the transformed metric space. The third novel contribution is the design and implementation of a fast and effective trajectory pattern mining algorithm TrajPod. TrajPod can extract the complete set of frequent trajectory patterns from large-scale trajectory datasets by utilizing space-efficient data structures and locality-aware spatial and temporal correlations for computational efficiency. A comprehensive performance study shows that TrajPod outperforms existing sequential pattern mining algorithms by an order of magnitude. Trajectory mining Pattern mining Trajectory clustering Road networks Mobile object trajectory
29	Using Differential Sequence Mining to Associate Patterns of Interactions in Concept Mapping Activity with Dimensions of Collaborative Process January 2015 (has links) abstract: Computer supported collaborative learning (CSCL) has made great inroads in classroom teaching marked by the use of tools and technologies to support and enhance collaborative learning. Computer mediated learning environments produce large amounts of data, capturing student interactions, which can be used to analyze students’ learning behaviors (Martinez-Maldonado et al., 2013a). The analysis of the process of collaboration is an active area of research in CSCL. Contributing towards this area, Meier et al. (2007) defined nine dimensions and gave a rating scheme to assess the quality of collaboration. This thesis aims to extract and examine frequent patterns of students’ interactions that characterize strong and weak groups across the above dimensions. To achieve this, an exploratory data mining technique, differential sequence mining, was employed using data from a collaborative concept mapping activity where collaboration amongst students was facilitated by an interactive tabletop. The results associate frequent patterns of collaborative concept mapping process with some of the dimensions assessing the quality of collaboration. The analysis of associating these patterns with the dimensions of collaboration is theoretically grounded, considering aspects of collaborative learning, concept mapping, communication, group cognition and information processing. The results are preliminary but still demonstrate the potential of associating frequent patterns of interactions with strong and weak groups across specific dimensions of collaboration, which is relevant for students, teachers, and researchers to monitor the process of collaborative learning. The frequent patterns for strong groups reflected conformance to the process of conversation for dimensions related to “communication” aspect of collaboration. In terms of the concept mapping sub-processes the frequent patterns for strong groups reflect the presentation phase of conversation with processes like talking, sharing individual maps while constructing the groups concept map followed by short utterances which represents the acceptance phase. For “joint information processing” aspect of collaboration, the frequent patterns for strong groups were marked by learners’ contributing more upon each other’s work. In terms of the concept mapping sub-processes the frequent patterns were marked by learners adding links to each other’s concepts or working with each other’s concepts, while revising the group concept map. / Dissertation/Thesis / Masters Thesis Computer Science 2015 Computer science Educational technology Collaborative learning CSCL Educational Data Mining Sequential Pattern Mining
30	Graph-based Multi-view Clustering for Continuous Pattern Mining Åleskog, Christoffer January 2021 (has links) Background. In many smart monitoring applications, such as smart healthcare, smart building, autonomous cars etc., data are collected from multiple sources and contain information about different perspectives/views of the monitored phenomenon, physical object, system. In addition, in many of those applications the availability of relevant labelled data is often low or even non-existing. Inspired by this, in this thesis study we propose a novel algorithm for multi-view stream clustering. The algorithm can be applied for continuous pattern mining and labeling of streaming data. Objectives. The main objective of this thesis is to develop and implement a novel multi-view stream clustering algorithm. In addition, the potential of the proposed algorithm is studied and evaluated on two datasets: synthetic and real-world. The conducted experiments study the new algorithm’s performance compared to a single-view clustering algorithm and an algorithm without transferring knowledge between chunks. Finally, the obtained results are analyzed, discussed and interpreted. Methods. Initially, we study the state-of-the-art multi-view (stream) clustering algorithms. Then we develop our multi-view clustering algorithm for streaming data by implementing transfer of knowledge feature. We present and explain in details the developed algorithm by motivating each choice made during the algorithm design phase. Finally, discussion of the algorithm configuration, experimental setup and the datasets chosen for the experiments are presented and motivated. Results. Different configurations of the proposed algorithm have been studied and evaluated under different experimental scenarios on two different datasets: synthetic and real-world. The proposed multi-view clustering algorithm has demonstrated higher performance on the synthetic data than on the real-world dataset. This is mainly due to not very good quality of the used real-world data. Conclusions. The proposed algorithm has demonstrated higher performance results on the synthetic dataset than on the real-world dataset. It can generate high-quality clustering solutions with respect to the used evaluation metrics. In addition, the transfer of knowledge feature has been shown to have a positive effect on the algorithm performance. A further study of the proposed algorithm on other richer and more suitable datasets, e.g., data collected from numerous sensors used for monitoring some phenomenon, is planned to be conducted in the future work. Machine Learning Unsupervised Learning Multi-view Clustering Data Stream Mining Pattern Mining Computer Sciences Datavetenskap (datalogi)

Search results