Spelling suggestions: "subject:"[een] STREAMING DATA"" "subject:"[enn] STREAMING DATA""
1 |
Mining and Managing Neighbor-Based Patterns in Data StreamsYang, Di 09 January 2012 (has links)
The current data-intensive world is continuously producing huge volumes of live streaming data through various kinds of electronic devices, such as sensor networks, smart phones, GPS and RFID systems. To understand these data sources and thus better leverage them to serve human society, the demands for mining complex patterns from these high speed data streams have significantly increased in a broad range of application domains, such as financial analysis, social network analysis, credit fraud detection, and moving object monitoring. In this dissertation, we present a framework to tackle the mining and management problem for the family of neighbor-based patterns in data streams, which covers a broad range of popular pattern types, including clusters, outliers, k-nearest neighbors and others. First, we study the problem of efficiently executing single neighbor-based pattern mining queries. We propose a general optimization principle for incremental pattern maintenance in data streams, called "Predicted Views". This general optimization principle exploits the "predictability" of sliding window semantics to eliminate both the computational and storage effort needed for handling the expiration of stream objects, which usually constitutes the most expensive operations for incremental pattern maintenance. Second, the problem of multiple query optimization for neighbor-based pattern mining queries is analyzed, which aims to efficiently execute a heavy workload of neighbor-based pattern mining queries using shared execution strategies. We present an integrated pattern maintenance strategy to represent and incrementally maintain the patterns identified by queries with different query parameters within a single compact structure. Our solution realizes fully shared execution of multiple queries with arbitrary parameter settings. Third, the problem of summarization and matching for neighbor-based patterns is examined. To solve this problem, we first propose a summarization format for each pattern type. Then, we present computation strategies, which efficiently summarize the neighbor-based patterns either during or after the online pattern extraction process. Lastly, to compare patterns extracted on different time horizon of the stream, we design an efficient matching mechanism to identify similar patterns in the stream history for any given pattern of interest to an analyst. Our comprehensive experimental studies, using both synthetic as well as real data from domains of stock trades and moving object monitoring, demonstrate superiority of our proposed strategies over alternate methods in both effectiveness and efficiency.
|
2 |
Label Free Change Detection on Streaming Data with Cooperative Multi-objective Genetic ProgrammingRahimi, Sara 09 August 2013 (has links)
Classification under streaming data conditions requires that the machine learning approach operate interactively with the stream content. Thus, given some initial machine learning classification capability, it is not possible to assume that the process `generating' stream content will be stationary. It is therefore necessary to first detect when the stream content changes. Only after detecting a change, can classifier retraining be triggered. Current methods for change detection tend to assume an entropy filter approach, where class labels are necessary. In practice, labeling the stream would be extremely expensive. This work proposes an approach in which the behavior of GP individuals is used to detect change without} the use of labels. Only after detecting a change is label information requested. Benchmarking under three computer network traffic analysis scenarios demonstrates that the proposed approach performs at least as well as the filter method, while retaining the advantage of requiring no labels.
|
3 |
Implementace uživatelsky orientované vizualizační platformy pro proudová data / Implementation of a user-centered visualization platform for stream dataBalliu, Ilda January 2020 (has links)
With the complexity increase of enterprise solutions, the need to monitor and maintain them increases with it. SAP Concur offers various services and applications across different environments and data centers. For all these applications and the services underneath, there are different Application Performance Management (APM) tools in place for monitoring them. However, from an incident management point of view, in case of a problem it is time consuming and non-efficient to go through different tools in order to identify the issue. This thesis proposes a solution for a custom and centralized APM which gathers metrics and raw data from multiple sources and visualizes them in real-time in a unified health dashboard called Pulse. In order to fit this solution to the needs of service managers and product owners, Pulse will go through different phases of usability tests and after each phase, new requirements will be implemented and tested again until there is a final design that fits the needs of target users.
|
4 |
[pt] MODELOS ESTATÍSTICOS COM PARÂMETROS VARIANDO SEGUNDO UM MECANISMO ADAPTATIVO / [en] STATISTICAL MODELS WITH PARAMETERS CHANGING THROUGH AN ADAPTIVE MECHANISMHENRIQUE HELFER HOELTGEBAUM 23 October 2019 (has links)
[pt] Esta tese é composta de três artigos em que a ligação entre eles são modelos estatísticos com parametros variantes no tempo. Todos os artigos adotam um arcabouço que utiliza um mecanismo guiado pelos dados para a atualização dos parâmetros dos modelos. O primeiro explora a aplicação de uma nova classe de modelos de séries temporais não Gaussianas denominada modelos Generalized Autegressive Scores (GAS). Nessa classe de modelos, os parâmetros são atualizados utilizando o score da densidade preditiva. Motivamos o uso de modelos GAS simulando cenários conjuntos de fator de capacidade eólico. Nos últimos dois artigos, o gradiente descentente estocástico (SGD) é adotado para atualizar os parâmetros que variam no tempo. Tal metodologia utiliza a derivada de uma função custo especificada pelo usuário para guiar a otimização. A estrutura desenvolvida foi projetada para ser aplicada em um contexto de fluxo de dados contínuo, portanto, técnicas de filtragem adaptativa são exploradas para levar em consideração o concept-drift. Exploramos esse arcabouço com aplicações em segurança cibernética e infra-estrutura instrumentada. / [en] This thesis is composed of three papers in which the common ground among them is statistical models with time-varying parameters. All of them adopt a framework that uses a data-driven mechanism to update
its coefficients. The first paper explores the application of a new class of non-Gaussian time series framework named Generalized Autoregressive Scores (GAS) models. In this class of models the parameters are updated using the score of the predictive density. We motivate the use of GAS models by simulating joint scenarios of wind power generation. In the last two papers, Stochastic Gradient Descent (SGD) is adopted to update time-varying parameters. This methodology uses the derivative of a user specified cost function to drive the optimization. The developed framework is designed to be applied in a streaming data context, therefore adaptive filtering techniques are explored to account for concept-drift.We explore this framework on cyber-security and instrumented infrastructure applications.
|
5 |
A Spreadsheet Model for Using Web Services and Creating Data-Driven ApplicationsChang, Kerry Shih-Ping 01 April 2016 (has links)
Web services have made many kinds of data and computing services available. However, to use web services often requires significant programming efforts and thus limits the people who can take advantage of them to only a small group of skilled programmers. In this dissertation, I will present a tool called Gneiss that extends the spreadsheet model to support four challenging aspects of using web services: programming two-way data communications with web services, creating interactive GUI applications that use web data sources, using hierarchical data, and using live streaming data. Gneiss contributes innovations in spreadsheet languages, spreadsheet user interfaces and interaction techniques to allow programming tasks that currently require writing complex, lengthy code to instead be done using familiar spreadsheet mechanisms. Spreadsheets are arguably the most successful and popular data tools among people of all programming levels. This work advances the use of spreadsheets to new domains and could benefit a wide range of users from professional programmers to end-user programmers.
|
6 |
Exploratory Visualization of Data Pattern Changes in Multivariate Data StreamsXie, Zaixian 21 October 2011 (has links)
"
More and more researchers are focusing on the management, querying and pattern mining of streaming data. The visualization of streaming data, however, is still a very new topic. Streaming data is very similar to time-series data since each datapoint has a time dimension. Although the latter has been well studied in the area of information visualization, a key characteristic of streaming data, unbounded and large-scale input, is rarely investigated. Moreover, most techniques for visualizing time-series data focus on univariate data and seldom convey multidimensional relationships, which is an important requirement in many application areas. Therefore, it is necessary to develop appropriate techniques for streaming data instead of directly applying time-series visualization techniques to it.
As one of the main contributions of this dissertation, I introduce a user-driven approach for the visual analytics of multivariate data streams based on effective visualizations via a combination of windowing and sampling strategies. To help users identify and track how data patterns change over time, not only the current sliding window content but also abstractions of past data in which users are interested are displayed. Sampling is applied within each single time window to help reduce visual clutter as well as preserve data patterns. Sampling ratios scheduled for different windows reflect the degree of user interest in the content. A degree of interest (DOI) function is used to represent a user's interest in different windows of the data. Users can apply two types of pre-defined DOI functions, namely RC (recent change) and PP (periodic phenomena) functions. The developed tool also allows users to interactively adjust DOI functions, in a manner similar to transfer functions in volume visualization, to enable a trial-and-error exploration process. In order to visually convey the change of multidimensional correlations, four layout strategies were designed. User studies showed that three of these are effective techniques for conveying data pattern changes compared to traditional time-series data visualization techniques. Based on this evaluation, a guide for the selection of appropriate layout strategies was derived, considering the characteristics of the targeted datasets and data analysis tasks. Case studies were used to show the effectiveness of DOI functions and the various visualization techniques.
A second contribution of this dissertation is a data-driven framework to merge and thus condense time windows having small or no changes and distort the time axis. Only significant changes are shown to users. Pattern vectors are introduced as a compact format for representing the discovered data model. Three views, juxtaposed views, pattern vector views, and pattern change views, were developed for conveying data pattern changes. The first shows more details of the data but needs more canvas space; the last two need much less canvas space via conveying only the pattern parameters, but lose many data details. The experiments showed that the proposed merge algorithms preserves more change information than an intuitive pattern-blind averaging. A user study was also conducted to confirm that the proposed techniques can help users find pattern changes more quickly than via a non-distorted time axis.
A third contribution of this dissertation is the history views with related interaction techniques were developed to work under two modes: non-merge and merge. In the former mode, the framework can use natural hierarchical time units or one defined by domain experts to represent timelines. This can help users navigate across long time periods. Grid or virtual calendar views were designed to provide a compact overview for the history data. In addition, MDS pattern starfields, distance maps, and pattern brushes were developed to enable users to quickly investigate the degree of pattern similarity among different time periods. For the merge mode, merge algorithms were applied to selected time windows to generate a merge-based hierarchy. The contiguous time windows having similar patterns are merged first. Users can choose different levels of merging with the tradeoff between more details in the data and less visual clutter in the visualizations. The usability evaluation demonstrated that most participants could understand the concepts of the history views correctly and finished assigned tasks with a high accuracy and relatively fast response time. "
|
7 |
Machine Learning Methods for Network Intrusion Detection and Intrusion Prevention SystemsStefanova, Zheni Svetoslavova 03 July 2018 (has links)
Given the continuing advancement of networking applications and our increased dependence upon software-based systems, there is a pressing need to develop improved security techniques for defending modern information technology (IT) systems from malicious cyber-attacks. Indeed, anyone can be impacted by such activities, including individuals, corporations, and governments. Furthermore, the sustained expansion of the network user base and its associated set of applications is also introducing additional vulnerabilities which can lead to criminal breaches and loss of critical data. As a result, the broader cybersecurity problem area has emerged as a significant concern, with many solution strategies being proposed for both intrusion detection and prevention. Now in general, the cybersecurity dilemma can be treated as a conflict-resolution setup entailing a security system and minimum of two decision agents with competing goals (e.g., the attacker and the defender). Namely, on the one hand, the defender is focused on guaranteeing that the system operates at or above an adequate (specified) level. Conversely, the attacker is focused on trying to interrupt or corrupt the system’s operation.
In light of the above, this dissertation introduces novel methodologies to build appropriate strategies for system administrators (defenders). In particular, detailed mathematical models of security systems are developed to analyze overall performance and predict the likely behavior of the key decision makers influencing the protection structure. The initial objective here is to create a reliable intrusion detection mechanism to help identify malicious attacks at a very early stage, i.e., in order to minimize potentially critical consequences and damage to system privacy and stability. Furthermore, another key objective is also to develop effective intrusion prevention (response) mechanisms. Along these lines, a machine learning based solution framework is developed consisting of two modules. Specifically, the first module prepares the system for analysis and detects whether or not there is a cyber-attack. Meanwhile, the second module analyzes the type of the breach and formulates an adequate response. Namely, a decision agent is used in the latter module to investigate the environment and make appropriate decisions in the case of uncertainty. This agent starts by conducting its analysis in a completely unknown milieu but continually learns to adjust its decision making based upon the provided feedback. The overall system is designed to operate in an automated manner without any intervention from administrators or other cybersecurity personnel. Human input is essentially only required to modify some key model (system) parameters and settings. Overall, the framework developed in this dissertation provides a solid foundation from which to develop improved threat detection and protection mechanisms for static setups, with further extensibility for handling streaming data.
|
8 |
Efficient Handling of Narrow Width and Streaming Data in Embedded ApplicationsLi, Bengu January 2006 (has links)
Embedded environment imposes severe constraints of system resources on embedded applications. Performance, memory footprint, and power consumption are critical factors for embedded applications. Meanwhile, the data in embedded applications demonstrate unique properties. More specifically, narrow width data are data representable in considerably fewer bits than in one word, which nevertheless occupy an entire register or memory word and streaming data are the input data processed by an application sequentially, which stay in the system for a short duration and thus exhibit little data locality. Narrow width and streaming data affect the efficiency of register, cache, and memory and must be taken into account when optimizing for performance, memory footprint, and power consumption.This dissertation proposes methods to efficiently handle narrow width and streaming data in embedded applications. Quantitative measurements of narrow width and streaming data are performed to provide guidance for optimizations. Novel architectural features and associated compiler algorithms are developed. To efficiently handle narrow width data in registers, two register allocation schemes are proposed for the ARM processor to allocate two narrow width variables to one register. A static scheme exploits maximum bitwidth. A speculative scheme further exploits dynamic bitwidth. Both result in reduced spill cost and performance improvement. To efficiently handle narrow width data in memory, a memory layout method is proposed to coalesce multiple narrow width data in one memory location in a DSP processor, leading to fewer explicit address calculations. This method improves performance and shrinks memory footprint. To efficiently handle streaming data in network processor, two cache mechanisms are proposed to enable the reuse of data and computation. The slack created is further transformed into reduction in energy consumption through a fetch gating mechanism.
|
9 |
Erbium : Reconciling languages, runtimes, compilation and optimizations for streaming applicationsMiranda, Cupertino 11 February 2013 (has links) (PDF)
As transistors size and power limitations stroke computer industry, hardware parallelism arose as the solution, bringing old forgotten problems back into equation to solve the existing limitations of current parallel technologies. Compilers regain focus by being the most relevant puzzle piece in the quest for the expected computer performance improvements predicted by Moores law no longer possible without parallelism. Parallel research is mainly focused in either the language or architectural aspects, not really giving the needed attention to compiler problems, being the reason for the weak compiler support by many parallel languages or architectures, not allowing to exploit performance to the best. This thesis addresses these problems by presenting: Erbium, a low level streaming data-flow language supporting multiple producer and consumer task communication; a very efficient runtime implementation for x86 architectures also addressing other types of architectures; a compiler integration of the language as an intermediate representation in GCC; a study of the language primitives dependencies, allowing compilers to further optimise the Erbium code not only through specific parallel optimisations but also through traditional compiler optimisations, such as partial redundancy elimination and dead code elimination.
|
10 |
Clustering Techniques for Mining and Analysis of Evolving DataDevagiri, Vishnu Manasa January 2021 (has links)
The amount of data generated is on rise due to increased demand for fields like IoT, smart monitoring applications, etc. Data generated through such systems have many distinct characteristics like continuous data generation, evolutionary, multi-source nature, and heterogeneity. In addition, the real-world data generated in these fields is largely unlabelled. Clustering is an unsupervised learning technique used to group, analyze and interpret unlabelled data. Conventional clustering algorithms are not suitable for dealing with data having previously mentioned characteristics due to memory and computational constraints, their inability to handle concept drift, distributed location of data. Therefore novel clustering approaches capable of analyzing and interpreting evolving and/or multi-source streaming data are needed. The thesis is focused on building evolutionary clustering algorithms for data that evolves over time. We have initially proposed an evolutionary clustering approach, entitled Split-Merge Clustering (Paper I), capable of continuously updating the generated clustering solution in the presence of new data. Through the progression of the work, new challenges have been studied and addressed. Namely, the Split-Merge Clustering algorithm has been enhanced in Paper II with new capabilities to deal with the challenges of multi-view data applications. A multi-view or multi-source data presents the studied phenomenon/system from different perspectives (views), and can reveal interesting knowledge that is not visible when only one view is considered and analyzed. This has motivated us to continue in this direction by designing two other novel multi-view data stream clustering algorithms. The algorithm proposed in Paper III improves the performance and interpretability of the algorithm proposed in Paper II. Paper IV introduces a minimum spanning tree based multi-view clustering algorithm capable of transferring knowledge between consecutive data chunks, and it is also enriched with a post-clustering pattern-labeling procedure. The proposed and studied evolutionary clustering algorithms are evaluated on various data sets. The obtained results have demonstrated the robustness of the algorithms for modeling, analyzing, and mining evolving data streams. They are able to adequately adapt single and multi-view clustering models by continuously integrating newly arriving data.
|
Page generated in 0.3219 seconds