Global ETD Search

11	Mining Time-Changing Data Streams Tao, Yingying January 2011 (has links) Streaming data have gained considerable attention in database and data mining communities because of the emergence of a class of applications, such as financial marketing, sensor networks, internet IP monitoring, and telecommunications that produce these data. Data streams have some unique characteristics that are not exhibited by traditional data: unbounded, fast-arriving, and time-changing. Traditional data mining techniques that make multiple passes over data or that ignore distribution changes are not applicable to dynamic data streams. Mining data streams has been an active research area to address requirements of the streaming applications. This thesis focuses on developing techniques for distribution change detection and mining time-changing data streams. Two techniques are proposed that can detect distribution changes in generic data streams. One approach for tackling one of the most popular stream mining tasks, frequent itemsets mining, is also presented in this thesis. All the proposed techniques are implemented and empirically studied. Experimental results show that the proposed techniques can achieve promising performance for detecting changes and mining dynamic data streams. data stream distribution change Computer Science
12	Novel applications of Association Rule Mining- Data Stream Mining Vithal Kadam, Omkar January 2009 (has links) From the advent of association rule mining, it has become one of the most researched areas of data exploration schemes. In recent years, implementing association rule mining methods in extracting rules from a continuous flow of voluminous data, known as Data Stream has generated immense interest due to its emerging applications such as network-traffic analysis, sensor-network data analysis. For such typical kinds of application domains, the facility to process such enormous amount of stream data in a single pass is critical. Data stream mining Association rule mining
13	Novel applications of Association Rule Mining- Data Stream Mining Vithal Kadam, Omkar January 2009 (has links) From the advent of association rule mining, it has become one of the most researched areas of data exploration schemes. In recent years, implementing association rule mining methods in extracting rules from a continuous flow of voluminous data, known as Data Stream has generated immense interest due to its emerging applications such as network-traffic analysis, sensor-network data analysis. For such typical kinds of application domains, the facility to process such enormous amount of stream data in a single pass is critical. Data stream mining Association rule mining
14	Toward a Data-Type-Based Real Time Geospatial Data Stream Management System Zhang, Chengyang 05 1900 (has links) The advent of sensory and communication technologies enables the generation and consumption of large volumes of streaming data. Many of these data streams are geo-referenced. Existing spatio-temporal databases and data stream management systems are not capable of handling real time queries on spatial extents. In this thesis, we investigated several fundamental research issues toward building a data-type-based real time geospatial data stream management system. The thesis makes contributions in the following areas: geo-stream data models, aggregation, window-based nearest neighbor operators, and query optimization strategies. The proposed geo-stream data model is based on second-order logic and multi-typed algebra. Both abstract and discrete data models are proposed and exemplified. I further propose two useful geo-stream operators, namely Region By and WNN, which abstract common aggregation and nearest neighbor queries as generalized data model constructs. Finally, I propose three query optimization algorithms based on spatial, temporal, and spatio-temporal constraints of geo-streams. I show the effectiveness of the data model through many query examples. The effectiveness and the efficiency of the algorithms are validated through extensive experiments on both synthetic and real data sets. This work established the fundamental building blocks toward a full-fledged geo-stream database management system and has potential impact in many applications such as hazard weather alerting and monitoring, traffic analysis, and environmental modeling. Data stream sensor networks spatial database
15	Integrating Visual Data Flow Programming with Data Stream Management Melander, Lars January 2016 (has links) Data stream management and data flow programming have many things in common. In both cases one wants to transfer possibly infinite sequences of data items from one place to another, while performing transformations to the data. This Thesis focuses on the integration of a visual programming language with a data stream management system (DSMS) to support the construction, configuration, and visualization of data stream applications. In the approach, analyses of data streams are expressed as continuous queries (CQs) that emit data in real-time. The LabVIEW visual programming platform has been adapted to support easy specification of continuous visualization of CQ results. LabVIEW has been integrated with the DSMS SVALI through a stream-oriented client-server API. Query programming is declarative, and it is desirable to make the stream visualization declarative as well, in order to raise the abstraction level and make programming more intuitive. This has been achieved by adding a set of visual data flow components (VDFCs) to LabVIEW, based on the LabVIEW actor framework. With actor-based data flows, visualization of data stream output becomes more manageable, avoiding the procedural control structures used in conventional LabVIEW programming while still utilizing the comprehensive, built-in LabVIEW visualization tools. The VDFCs are part of the Visual Data stream Monitor (VisDM), which is a client-server based platform for handling real-time data stream applications and visualizing stream output. VDFCs are based on a data flow framework that is constructed from the actor framework, and are divided into producers, operators, consumers, and controls. They allow a user to set up the interface environment, customize the visualization, and convert the streaming data to a format suitable for visualization. Furthermore, it is shown how LabVIEW can be used to graphically define interfaces to data streams and dynamically load them in SVALI through a general wrapper handler. As an illustration, an interface has been defined in LabVIEW for accessing data streams from a digital 3D antenna. VisDM has successfully been tested in two real-world applications, one at Sandvik Coromant and one at the Ångström Laboratory, Uppsala University. For the first case, VisDM was deployed as a portable system to provide direct visualization of machining data streams. The data streams can differ in many ways as do the various visualization tasks. For the second case, data streams are homogenous, high-rate, and query operations are much more computation-demanding. For both applications, data is visualized in real-time, and VisDM is capable of sufficiently high update frequencies for processing and visualizing the streaming data without obstructions. The uniqueness of VisDM is the combination of a powerful and versatile DSMS with visually programmed and completely customizable visualization, while maintaining the complete extensibility of both.
16	Um estudo investigativo de algoritmos de regressão para data streams Nunes, André Luís 28 March 2017 (has links) Submitted by JOSIANE SANTOS DE OLIVEIRA (josianeso) on 2017-06-13T14:22:04Z No. of bitstreams: 1 André Luís Nunes_.pdf: 2523682 bytes, checksum: 5e3899cfac6d76db6b2c6ac16b7f5325 (MD5) / Made available in DSpace on 2017-06-13T14:22:04Z (GMT). No. of bitstreams: 1 André Luís Nunes_.pdf: 2523682 bytes, checksum: 5e3899cfac6d76db6b2c6ac16b7f5325 (MD5) Previous issue date: 2017-03-28 / Nenhuma / A explosão no volume de dados e a sua velocidade de expansão tornam as tarefas de descoberta do conhecimento e a análise de dados desafiantes, ainda mais quando consideradas bases não-estacionárias. Embora a predição de valores futuros exerça papel fundamental em áreas como: o clima, problemas de roteamentos e economia, entre outros, a classificação ainda parece ser a tarefa mais explorada. Recentemente, alguns algoritmos voltados à regressão de valores foram lançados, como por exemplo: FIMT-DD, AMRules, IBLStreams e SFNRegressor, entretanto seus estudos investigativos exploraram mais aspectos de inovação e análise do erro de predição, do que explorar suas capacidades mediante critérios apontados como fundamentais para data stream, como tempo de execução e memória. Dessa forma, o objetivo deste trabalho é apresentar um estudo investigativo sobre estes algoritmos que tratam regressão, considerando ambientes dinâmicos, utilizando bases de dados massivas, além de explorar a capacidade de adaptação dos algoritmos com a presença de concept drift. Para isto três bases de dados foram analisadas e estendidas para explorar os principais critérios de avaliação adotados, sendo realizada uma ampla experimentação que produziu uma comparação dos resultados obtidos frente aos algoritmos escolhidos, possibilitando gerar indicativos do comportamento de cada um mediante os diferentes cenários a que foram expostos. Assim, como principais contribuições deste trabalho são destacadas: a avaliação de critérios fundamentais: memória, tempo de execução e poder de generalização, relacionados a regressão para data stream; produção de uma análise crítica dos algoritmos investigados; e a possibilidade de reprodução e extensão dos estudos realizados pela disponibilização das parametrizações empregadas / The explosion of data volume and its expansion speed make tasks of finding knowledge and analyzing data challenging, even more so when non-stationary bases are considered. Although the future values prediction plays a fundamental role in areas such as climate, routing problems and economics, among others, classification seems to be still the most exploited task. Recently, some value-regression algorithms have been launched, for example: FIMT-DD, AMRules, IBLStreams and SFNRegressor; however, their investigative studies have explored more aspects of innovation and analysis of error prediction than exploring their capabilities through criteria that are considered fundamental to data stream, such as elapsed time and memory. In this way, the objective of this work is to present an investigative study about these algorithms that treat regression considering dynamic environments, using massive databases, and also explore the algorithm's adaptability capacity with the presence of concept drift. In order to do this, three databases were analyzed and extended to explore the main evaluation criteria adopted. A wide experiment was carried out, which produced a comparison of the results obtained with the chosen algorithms, allowing to generate behavior indication of each one through the different scenarios to which were exposed. Thus, the main contributions of this work are: evaluation of fundamental criteria: memory, execution time and power of generalization, related to regression to data stream; production of a critical analysis of the algorithms investigated; and the possibility of reproducing and extending the studies carried out by making available the parametrizations applyed. Mineração de data stream Regressão Concept drift Data stream mining Regression
17	Leistungsoptimierung der persistenten Datenverwaltung in DSP-Architekturen zur Live-Analyse von Sensordaten Weißbach, Manuel 28 October 2021 (has links) Aufgrund der in vielen Bereichen stets wachsenden Menge an zu verarbeitenden Daten haben sich Big-Data-Anwendungen in den letzten Jahren zunehmend verbreitet. Twitter gab bereits im Jahr 2011 an, täglich 15 Millionen URLs in Echtzeit zu untersuchen, um die Verbreitung von Spamlinks zu unterbinden [1]. Facebook verarbeitet pro Minute über vier Millionen „Gefällt mir“-Klicks und verwaltet über 300 Petabyte Daten [2]. Über das Businessportal LinkedIn wurden 2011 rund eine Milliarde Nachrichten pro Tag zugestellt, 2015 waren es laut Angaben des Unternehmens bereits 1,1 Billionen täglich versendete Nachrichten [3]. Diesem starken Anstieg liegt ein exponentielles Wachstum zugrunde, das für Big Data typisch ist. Gartner definiert den Begriff „Big Data“ auf Basis seiner spezifischen Eigenschaften, die in englischer Sprache auch als die „drei V´s“ bezeichnet werden: „Volume“, „Variety“ und „Velocity“ [4]. Neben der enormen Menge an zu verarbeitenden Daten („Volume“) und ihrer Vielfalt und Unstrukturiertheit („Variety“), ist demnach auch die Geschwindigkeit („Velocity“), in der die Daten generiert werden, ein wesentliches Merkmal von Big Data [5, 6]. Soll trotz der ständigen und immer schneller werdenden Generierung neuer Daten ein Verarbeitungsrückstau vermieden werden, so folgt daraus auch die Notwendigkeit, die kontinuierlich wachsenden Datenmengen immer schneller zu verarbeiten. info:eu-repo/classification/ddc/004 ddc:004
18	Outlier Detection In Big Data Cao, Lei 29 March 2016 (has links) The dissertation focuses on scaling outlier detection to work both on huge static as well as on dynamic streaming datasets. Outliers are patterns in the data that do not conform to the expected behavior. Outlier detection techniques are broadly applied in applications ranging from credit fraud prevention, network intrusion detection to stock investment tactical planning. For such mission critical applications, a timely response often is of paramount importance. Yet the processing of outlier detection requests is of high algorithmic complexity and resource consuming. In this dissertation we investigate the challenges of detecting outliers in big data -- in particular caused by the high velocity of streaming data, the big volume of static data and the large cardinality of the input parameter space for tuning outlier mining algorithms. Effective optimization techniques are proposed to assure the responsiveness of outlier detection in big data. In this dissertation we first propose a novel optimization framework called LEAP to continuously detect outliers over data streams. The continuous discovery of outliers is critical for a large range of online applications that monitor high volume continuously evolving streaming data. LEAP encompasses two general optimization principles that utilize the rarity of the outliers and the temporal priority relationships among stream data points. Leveraging these two principles LEAP not only is able to continuously deliver outliers with respect to a set of popular outlier models, but also provides near real-time support for processing powerful outlier analytics workloads composed of large numbers of outlier mining requests with various parameter settings. Second, we develop a distributed approach to efficiently detect outliers over massive-scale static data sets. In this big data era, as the volume of the data advances to new levels, the power of distributed compute clusters must be employed to detect outliers in a short turnaround time. In this research, our approach optimizes key factors determining the efficiency of distributed data analytics, namely, communication costs and load balancing. In particular we prove the traditional frequency-based load balancing assumption is not effective. We thus design a novel cost-driven data partitioning strategy that achieves load balancing. Furthermore, we abandon the traditional one detection algorithm for all compute nodes approach and instead propose a novel multi-tactic methodology which adaptively selects the most appropriate algorithm for each node based on the characteristics of the data partition assigned to it. Third, traditional outlier detection systems process each individual outlier detection request instantiated with a particular parameter setting one at a time. This is not only prohibitively time-consuming for large datasets, but also tedious for analysts as they explore the data to hone in on the most appropriate parameter setting or on the desired results. We thus design an interactive outlier exploration paradigm that is not only able to answer traditional outlier detection requests in near real-time, but also offers innovative outlier analytics tools to assist analysts to quickly extract, interpret and understand the outliers of interest. Our experimental studies including performance evaluation and user studies conducted on real world datasets including stock, sensor, moving object, and Geolocation datasets confirm both the effectiveness and efficiency of the proposed approaches. big data outlier detection data stream distributed algorithm data analytics
19	An Obstruction-Check Approach to Mining Closed Sequential Patterns in Data Streams Chin, Tsz-lin 21 June 2010 (has links) Online mining sequential patterns over data streams is an important problem in data mining. There are many applications of using sequential patterns in data streams, such as market analysis, network security, sensor networks and web track- ing. Previous studies have shown mining closed patterns provides more beneﬁts than mining the complete set of frequent patterns, since closed pattern mining leads to compact results. A sequential pattern is closed if the sequential pattern does not have any supersequence which has the same support. Chang et al. proposed a time- based sliding window model. The time-based sliding window has two features, the new item is inserted in front of a sequence, and the obsolete item is removed from of tail of a sequence. For solving the problem of data mining in the time-based sliding window, Chang et al. proposed an algorithm called SeqStream. It uses a data struc- ture IST (Inverse Closed Sequence Tree) to keep the result. IST can incrementally be updated by the SeqStream algorithm. Although the SeqStream algorithm has used the technique of dividing the time-based sliding window to speed up the updating of IST, the SeqStream algorithm still scans the sliding window many times when IST needs to be updated. In this thesis, we propose an obstruction-check approach to maintain the result of closed sequential patterns. Our approach is designed based on the lattice structure. The feature of the lattice structure is that the parent is a supersequence of its children. By utilizing this feature, we decide the obstruction link between the parent and child if their support is the same. If a node does not have any obstruction link parent, the node is a closed sequential pattern. Then we can utilize this feature to locally travel the lattice structure. Moreover, we can fully utilize the features of the time-based sliding window model to locally travel the lat- tice structure. Based on the lattice structure, we propose the EULB (Exact Update based on Lattice structure with Bit stream)-Lattice algorithm. The EULB-Lattice algorithm is an exact method for mining data streams. We record additional informa- tion, instead of scanning the entire sliding window. We conduct several experiments using diﬀerent synthetic data sets. The simulation results show that the proposed algorithm outperforms the SeqStream algorithm. Closed Sequential Pattern Lattice Sliding Window Sequential Pattern Data Stream
20	A Set-Checking Algorithm for Mining Maximal Frequent Itemsets from Data Streams Lin, Pei-Ying 15 July 2011 (has links) Online mining the maximal frequent itemsets over data streams is an important problem in data mining. The maximal frequent itemset is the itemset which the support is large or equal to the minimal support and the itemset is not the subset or superse of each itemset. Previous algorithms to mine the maximal frequent itemsets in the traditional database are not suitable for data streams. Because data streams have some characteristics: (1) continuous (2) fast (3) no data limit (4) real time (5) searching once, mining data streams have many new challenges. First, they are unrealistic to keep the entire stream in the main memory or even in a secondary storage area, since a data stream comes continuously and the amount of data is unbounded. Second, traditional methods of mining on stored datasets by multiple scans are infeasible, since the streaming data is passed only once. Third, mining streams requires fast, real-time processing in order to keep up with the high data arrival rate and mining results are expected to be available within short response time. In order to solve mining maximal frequent itemsets from data streams using the landmark window model, Mao et. al. propose the INSTANT algorithm. In the landmark window model, knowledge discovery is performed based on the values between the beginning time and the present. The advantage of using the landmark window model is that the results are correct as compared to the other models. The structure of the INSTANT algorithm is simple and it can save many memory space. But it takes long time in mining the maximal frequent itemsets. When the new transactions comes, the number of comparisons between the old transactions of INSATNT algorithm is too much. In this thesis, we propose the Set-Checking algorithm to mine frequent itemsets from data streams using the landmark window model. We use the structure of lattice to store our information. The structure of lattice records the subset relationship between the child node and the father node. For every node, we can record the itemset and the support. When the new transaction comes, we consider five relations: (1) equivalent (2) superset (3) subset (4) intersection (5) empty relations. According to the lattice structure of the five sets , we can add the transaction and the renew support efficiently. From our simulation result, we find that the process time of our Set-Checking algorithm is faster than that of the INSTANT algorithm. Itemset Landmark Window Model Lattice Maximal Frequent Itemset Data Stream

Search results