Global ETD Search

1	Mining and Managing Neighbor-Based Patterns in Data Streams Yang, Di 09 January 2012 (has links) The current data-intensive world is continuously producing huge volumes of live streaming data through various kinds of electronic devices, such as sensor networks, smart phones, GPS and RFID systems. To understand these data sources and thus better leverage them to serve human society, the demands for mining complex patterns from these high speed data streams have significantly increased in a broad range of application domains, such as financial analysis, social network analysis, credit fraud detection, and moving object monitoring. In this dissertation, we present a framework to tackle the mining and management problem for the family of neighbor-based patterns in data streams, which covers a broad range of popular pattern types, including clusters, outliers, k-nearest neighbors and others. First, we study the problem of efficiently executing single neighbor-based pattern mining queries. We propose a general optimization principle for incremental pattern maintenance in data streams, called "Predicted Views". This general optimization principle exploits the "predictability" of sliding window semantics to eliminate both the computational and storage effort needed for handling the expiration of stream objects, which usually constitutes the most expensive operations for incremental pattern maintenance. Second, the problem of multiple query optimization for neighbor-based pattern mining queries is analyzed, which aims to efficiently execute a heavy workload of neighbor-based pattern mining queries using shared execution strategies. We present an integrated pattern maintenance strategy to represent and incrementally maintain the patterns identified by queries with different query parameters within a single compact structure. Our solution realizes fully shared execution of multiple queries with arbitrary parameter settings. Third, the problem of summarization and matching for neighbor-based patterns is examined. To solve this problem, we first propose a summarization format for each pattern type. Then, we present computation strategies, which efficiently summarize the neighbor-based patterns either during or after the online pattern extraction process. Lastly, to compare patterns extracted on different time horizon of the stream, we design an efficient matching mechanism to identify similar patterns in the stream history for any given pattern of interest to an analyst. Our comprehensive experimental studies, using both synthetic as well as real data from domains of stock trades and moving object monitoring, demonstrate superiority of our proposed strategies over alternate methods in both effectiveness and efficiency. Algorithm Streaming Data Query Processing Data Mining
2	Label Free Change Detection on Streaming Data with Cooperative Multi-objective Genetic Programming Rahimi, Sara 09 August 2013 (has links) Classification under streaming data conditions requires that the machine learning approach operate interactively with the stream content. Thus, given some initial machine learning classification capability, it is not possible to assume that the process `generating' stream content will be stationary. It is therefore necessary to first detect when the stream content changes. Only after detecting a change, can classifier retraining be triggered. Current methods for change detection tend to assume an entropy filter approach, where class labels are necessary. In practice, labeling the stream would be extremely expensive. This work proposes an approach in which the behavior of GP individuals is used to detect change without} the use of labels. Only after detecting a change is label information requested. Benchmarking under three computer network traffic analysis scenarios demonstrates that the proposed approach performs at least as well as the filter method, while retaining the advantage of requiring no labels. Change Detection Streaming Data Genetic Programming
3	Implementace uživatelsky orientované vizualizační platformy pro proudová data / Implementation of a user-centered visualization platform for stream data Balliu, Ilda January 2020 (has links) With the complexity increase of enterprise solutions, the need to monitor and maintain them increases with it. SAP Concur offers various services and applications across different environments and data centers. For all these applications and the services underneath, there are different Application Performance Management (APM) tools in place for monitoring them. However, from an incident management point of view, in case of a problem it is time consuming and non-efficient to go through different tools in order to identify the issue. This thesis proposes a solution for a custom and centralized APM which gathers metrics and raw data from multiple sources and visualizes them in real-time in a unified health dashboard called Pulse. In order to fit this solution to the needs of service managers and product owners, Pulse will go through different phases of usability tests and after each phase, new requirements will be implemented and tested again until there is a final design that fits the needs of target users.
4	[pt] MODELOS ESTATÍSTICOS COM PARÂMETROS VARIANDO SEGUNDO UM MECANISMO ADAPTATIVO / [en] STATISTICAL MODELS WITH PARAMETERS CHANGING THROUGH AN ADAPTIVE MECHANISM HENRIQUE HELFER HOELTGEBAUM 23 October 2019 (has links) [pt] Esta tese é composta de três artigos em que a ligação entre eles são modelos estatísticos com parametros variantes no tempo. Todos os artigos adotam um arcabouço que utiliza um mecanismo guiado pelos dados para a atualização dos parâmetros dos modelos. O primeiro explora a aplicação de uma nova classe de modelos de séries temporais não Gaussianas denominada modelos Generalized Autegressive Scores (GAS). Nessa classe de modelos, os parâmetros são atualizados utilizando o score da densidade preditiva. Motivamos o uso de modelos GAS simulando cenários conjuntos de fator de capacidade eólico. Nos últimos dois artigos, o gradiente descentente estocástico (SGD) é adotado para atualizar os parâmetros que variam no tempo. Tal metodologia utiliza a derivada de uma função custo especificada pelo usuário para guiar a otimização. A estrutura desenvolvida foi projetada para ser aplicada em um contexto de fluxo de dados contínuo, portanto, técnicas de filtragem adaptativa são exploradas para levar em consideração o concept-drift. Exploramos esse arcabouço com aplicações em segurança cibernética e infra-estrutura instrumentada. / [en] This thesis is composed of three papers in which the common ground among them is statistical models with time-varying parameters. All of them adopt a framework that uses a data-driven mechanism to update its coefficients. The first paper explores the application of a new class of non-Gaussian time series framework named Generalized Autoregressive Scores (GAS) models. In this class of models the parameters are updated using the score of the predictive density. We motivate the use of GAS models by simulating joint scenarios of wind power generation. In the last two papers, Stochastic Gradient Descent (SGD) is adopted to update time-varying parameters. This methodology uses the derivative of a user specified cost function to drive the optimization. The developed framework is designed to be applied in a streaming data context, therefore adaptive filtering techniques are explored to account for concept-drift.We explore this framework on cyber-security and instrumented infrastructure applications. [pt] APRENDIZADO DE MAQUINA [pt] STREAMING DATA [pt] COPULA DINAMICA [pt] FILTRAGEM ADAPTATIVA [en] MACHINE LEARNING [en] STREAMING DATA [en] DYNAMIC COPULATION [en] ADAPTIVE FILTERING
5	A Spreadsheet Model for Using Web Services and Creating Data-Driven Applications Chang, Kerry Shih-Ping 01 April 2016 (has links) Web services have made many kinds of data and computing services available. However, to use web services often requires significant programming efforts and thus limits the people who can take advantage of them to only a small group of skilled programmers. In this dissertation, I will present a tool called Gneiss that extends the spreadsheet model to support four challenging aspects of using web services: programming two-way data communications with web services, creating interactive GUI applications that use web data sources, using hierarchical data, and using live streaming data. Gneiss contributes innovations in spreadsheet languages, spreadsheet user interfaces and interaction techniques to allow programming tasks that currently require writing complex, lengthy code to instead be done using familiar spreadsheet mechanisms. Spreadsheets are arguably the most successful and popular data tools among people of all programming levels. This work advances the use of spreadsheets to new domains and could benefit a wide range of users from professional programmers to end-user programmers. spreadsheets web services data-driven applications web applications streaming data hierarchical data
6	Exploratory Visualization of Data Pattern Changes in Multivariate Data Streams Xie, Zaixian 21 October 2011 (has links) " More and more researchers are focusing on the management, querying and pattern mining of streaming data. The visualization of streaming data, however, is still a very new topic. Streaming data is very similar to time-series data since each datapoint has a time dimension. Although the latter has been well studied in the area of information visualization, a key characteristic of streaming data, unbounded and large-scale input, is rarely investigated. Moreover, most techniques for visualizing time-series data focus on univariate data and seldom convey multidimensional relationships, which is an important requirement in many application areas. Therefore, it is necessary to develop appropriate techniques for streaming data instead of directly applying time-series visualization techniques to it. As one of the main contributions of this dissertation, I introduce a user-driven approach for the visual analytics of multivariate data streams based on effective visualizations via a combination of windowing and sampling strategies. To help users identify and track how data patterns change over time, not only the current sliding window content but also abstractions of past data in which users are interested are displayed. Sampling is applied within each single time window to help reduce visual clutter as well as preserve data patterns. Sampling ratios scheduled for different windows reflect the degree of user interest in the content. A degree of interest (DOI) function is used to represent a user's interest in different windows of the data. Users can apply two types of pre-defined DOI functions, namely RC (recent change) and PP (periodic phenomena) functions. The developed tool also allows users to interactively adjust DOI functions, in a manner similar to transfer functions in volume visualization, to enable a trial-and-error exploration process. In order to visually convey the change of multidimensional correlations, four layout strategies were designed. User studies showed that three of these are effective techniques for conveying data pattern changes compared to traditional time-series data visualization techniques. Based on this evaluation, a guide for the selection of appropriate layout strategies was derived, considering the characteristics of the targeted datasets and data analysis tasks. Case studies were used to show the effectiveness of DOI functions and the various visualization techniques. A second contribution of this dissertation is a data-driven framework to merge and thus condense time windows having small or no changes and distort the time axis. Only significant changes are shown to users. Pattern vectors are introduced as a compact format for representing the discovered data model. Three views, juxtaposed views, pattern vector views, and pattern change views, were developed for conveying data pattern changes. The first shows more details of the data but needs more canvas space; the last two need much less canvas space via conveying only the pattern parameters, but lose many data details. The experiments showed that the proposed merge algorithms preserves more change information than an intuitive pattern-blind averaging. A user study was also conducted to confirm that the proposed techniques can help users find pattern changes more quickly than via a non-distorted time axis. A third contribution of this dissertation is the history views with related interaction techniques were developed to work under two modes: non-merge and merge. In the former mode, the framework can use natural hierarchical time units or one defined by domain experts to represent timelines. This can help users navigate across long time periods. Grid or virtual calendar views were designed to provide a compact overview for the history data. In addition, MDS pattern starfields, distance maps, and pattern brushes were developed to enable users to quickly investigate the degree of pattern similarity among different time periods. For the merge mode, merge algorithms were applied to selected time windows to generate a merge-based hierarchy. The contiguous time windows having similar patterns are merged first. Users can choose different levels of merging with the tradeoff between more details in the data and less visual clutter in the visualizations. The usability evaluation demonstrated that most participants could understand the concepts of the history views correctly and finished assigned tasks with a high accuracy and relatively fast response time. " Multivariate Data Visualization Streaming Data Visualization Time-series Data Visualization Data streams
7	Machine Learning Methods for Network Intrusion Detection and Intrusion Prevention Systems Stefanova, Zheni Svetoslavova 03 July 2018 (has links) Given the continuing advancement of networking applications and our increased dependence upon software-based systems, there is a pressing need to develop improved security techniques for defending modern information technology (IT) systems from malicious cyber-attacks. Indeed, anyone can be impacted by such activities, including individuals, corporations, and governments. Furthermore, the sustained expansion of the network user base and its associated set of applications is also introducing additional vulnerabilities which can lead to criminal breaches and loss of critical data. As a result, the broader cybersecurity problem area has emerged as a significant concern, with many solution strategies being proposed for both intrusion detection and prevention. Now in general, the cybersecurity dilemma can be treated as a conflict-resolution setup entailing a security system and minimum of two decision agents with competing goals (e.g., the attacker and the defender). Namely, on the one hand, the defender is focused on guaranteeing that the system operates at or above an adequate (specified) level. Conversely, the attacker is focused on trying to interrupt or corrupt the system’s operation. In light of the above, this dissertation introduces novel methodologies to build appropriate strategies for system administrators (defenders). In particular, detailed mathematical models of security systems are developed to analyze overall performance and predict the likely behavior of the key decision makers influencing the protection structure. The initial objective here is to create a reliable intrusion detection mechanism to help identify malicious attacks at a very early stage, i.e., in order to minimize potentially critical consequences and damage to system privacy and stability. Furthermore, another key objective is also to develop effective intrusion prevention (response) mechanisms. Along these lines, a machine learning based solution framework is developed consisting of two modules. Specifically, the first module prepares the system for analysis and detects whether or not there is a cyber-attack. Meanwhile, the second module analyzes the type of the breach and formulates an adequate response. Namely, a decision agent is used in the latter module to investigate the environment and make appropriate decisions in the case of uncertainty. This agent starts by conducting its analysis in a completely unknown milieu but continually learns to adjust its decision making based upon the provided feedback. The overall system is designed to operate in an automated manner without any intervention from administrators or other cybersecurity personnel. Human input is essentially only required to modify some key model (system) parameters and settings. Overall, the framework developed in this dissertation provides a solid foundation from which to develop improved threat detection and protection mechanisms for static setups, with further extensibility for handling streaming data. Intrusion Detection Machine Learning Network Security Q-learning Streaming data Computer Sciences Statistics and Probability
8	Efficient Handling of Narrow Width and Streaming Data in Embedded Applications Li, Bengu January 2006 (has links) Embedded environment imposes severe constraints of system resources on embedded applications. Performance, memory footprint, and power consumption are critical factors for embedded applications. Meanwhile, the data in embedded applications demonstrate unique properties. More specifically, narrow width data are data representable in considerably fewer bits than in one word, which nevertheless occupy an entire register or memory word and streaming data are the input data processed by an application sequentially, which stay in the system for a short duration and thus exhibit little data locality. Narrow width and streaming data affect the efficiency of register, cache, and memory and must be taken into account when optimizing for performance, memory footprint, and power consumption.This dissertation proposes methods to efficiently handle narrow width and streaming data in embedded applications. Quantitative measurements of narrow width and streaming data are performed to provide guidance for optimizations. Novel architectural features and associated compiler algorithms are developed. To efficiently handle narrow width data in registers, two register allocation schemes are proposed for the ARM processor to allocate two narrow width variables to one register. A static scheme exploits maximum bitwidth. A speculative scheme further exploits dynamic bitwidth. Both result in reduced spill cost and performance improvement. To efficiently handle narrow width data in memory, a memory layout method is proposed to coalesce multiple narrow width data in one memory location in a DSP processor, leading to fewer explicit address calculations. This method improves performance and shrinks memory footprint. To efficiently handle streaming data in network processor, two cache mechanisms are proposed to enable the reuse of data and computation. The slack created is further transformed into reduction in energy consumption through a fetch gating mechanism. compiler computer architecture narrow width data streaming data embedded system embedded application
9	Erbium : Reconciling languages, runtimes, compilation and optimizations for streaming applications Miranda, Cupertino 11 February 2013 (has links) (PDF) As transistors size and power limitations stroke computer industry, hardware parallelism arose as the solution, bringing old forgotten problems back into equation to solve the existing limitations of current parallel technologies. Compilers regain focus by being the most relevant puzzle piece in the quest for the expected computer performance improvements predicted by Moores law no longer possible without parallelism. Parallel research is mainly focused in either the language or architectural aspects, not really giving the needed attention to compiler problems, being the reason for the weak compiler support by many parallel languages or architectures, not allowing to exploit performance to the best. This thesis addresses these problems by presenting: Erbium, a low level streaming data-flow language supporting multiple producer and consumer task communication; a very efficient runtime implementation for x86 architectures also addressing other types of architectures; a compiler integration of the language as an intermediate representation in GCC; a study of the language primitives dependencies, allowing compilers to further optimise the Erbium code not only through specific parallel optimisations but also through traditional compiler optimisations, such as partial redundancy elimination and dead code elimination. [INFO:INFO_OH] Computer Science/Other [INFO:INFO_OH] Informatique/Autre Streaming data-flow Intermediate representation Compilation Optimisations Runtime
10	TupleSearch : A scalable framework based on sketches to process and store streaming temporal data for real time analytics Karlsson, Henrik January 2017 (has links) In many fields, there is a need for quick analysis of data. As the number of devices connected to the Internet grows, so does the amounts of data generated. The traditional way of analyzing large amounts of data has been by using batch processing, where the already collected data is pro-cessed. This process is time consuming, resulting in another trend emerg-ing: stream processing. Stream processing is when data is processed and stored as it arrives. Because of the velocity, volume and variations in data. Stream processing is best carried out in the main memory, and means processing and storing data as it arrives, which makes it a big challenge. This thesis focuses on developing a framework for the processing and storing of streaming temporal data enabling the data to be analyzed in real time. For this purpose, a server application was created consisting of approximate in-memory data synopsizes, called sketches, to process and store the input data. Furthermore, a client web application was created to query and analyze the data. The results show that the framework can sup-port simple aggregate queries with constant query time regardless to the volume of data. Also, it can process data 6.8 times faster than a traditional database system. All this implies that the system is scalable, at the same time it with a query error vs. memory trade-off. For a distribution of ~3000000 unique items it was concluded that the framework can provide very accurate answers, with an error rate less than 1.1%, for the trendiest data using about 100 times less space than the actual size of the data set. Streaming Data Stream Processing Count-Min Sketch Time Adaptive Sketches Computer Engineering Datorteknik

Search results