Global ETD Search

31	Elastic Data Stream Processing Heinze, Thomas 27 October 2021 (has links) Data stream processing systems are used to process data from high velocity data sources like financial, sensor, or logistics data. Many use cases force these systems to use a distributed setup to be able to fulfill the strict requirements regarding expected system throughput and end-to-end latency. The major challenge for a distributed data stream processing system is unpredictable load peaks. Most systems use overprovisioning to solve this problem, which leads to a low system utilization and high monetary cost for the user. This doctoral thesis studies a potential solution to this problem by automatic scaling in or out based on the changing workload. This approach is called elastic scaling and allows a cost-efficient execution of the system with a high quality of service. In this thesis, we present our elastic scaling data stream processing system FUGU and address three major challenges of such systems: 1) consideration of user-defined end-to-end latency constraints during the elastic scaling, 2) study of different auto-scaling techniques, and 3) combination of elastic scaling with different fault tolerance techniques. First, we demonstrate how our system considers user-defined end-to-end latency constraints during the scaling decisions. Each scaling decision causes short latency peaks, because the processing needs to be paused while operators are moved. FUGU estimates the latency peaks for different scaling decisions, tries to minimize the created latency peak and at the same time to achieve similar monetary costs like alternative approaches. Second, we study different auto-scaling techniques for elastic-scaling data stream processing systems. Auto-scaling techniques are a very important part of such systems as they derive the scaling decisions. In this thesis, we study three auto-scaling techniques: Threshold-based Scaling, Reinforcement Learning and the novel Online Parameter Optimization. The Online Parameter Optimization overcomes the shortcomings of the two other approaches by avoiding manual tuning and being robust towards different workload patterns. Finally, we present an integration of an elastic scaling with different replication techniques for high availability to allow to minimize the spent monetary cost and to ensure at the same time a maximal recovery time. We leverage two replication approaches in FUGU and evaluate a trade-off between recovery time and overhead. FUGU estimates the recovery time and adaptively optimizes the used replication technique for each operator. All these contributions are carefully evaluated in three real-world scenarios and we discuss the relationship of our contributions towards related work. Read more Datenströme, Elastische Skalierung info:eu-repo/classification/ddc/004 ddc:004
32	Handling Tradeoffs between Performance and Query-Result Quality in Data Stream Processing Ji, Yuanzhen 28 November 2017 (has links) Data streams in the form of potentially unbounded sequences of tuples arise naturally in a large variety of domains including finance markets, sensor networks, social media, and network traffic management. The increasing number of applications that require processing data streams with high throughput and low latency have promoted the development of data stream processing systems (DSPS). A DSPS processes data streams with continuous queries, which are issued once and return query results to users continuously as new tuples arrive. For stream-based applications, both the query-execution performance (in terms of, e.g., throughput and end-to-end latency) and the quality of produced query results (in terms of, e.g., accuracy and completeness) are important. However, a DSPS often needs to make tradeoffs between these two requirements, either because of the data imperfection within the streams, or because of the limited computation capacity of the DSPS itself. Performance versus result-quality tradeoffs caused by data imperfection are inevitable, because the quality of the incoming data is beyond the control of a DSPS, whereas tradeoffs caused by system limitations can be alleviated—even erased—by enhancing the DSPS itself. This dissertation seeks to advance the state of the art on handling the performance versus result-quality tradeoffs in data stream processing caused by the above two aspects of reasons. For tradeoffs caused by data imperfection, this dissertation focuses on the typical data-imperfection problem of stream disorder and proposes the concept of quality-driven disorder handling (QDDH). QDDH enables a DSPS to make flexible and user-configurable tradeoffs between the end-to-end latency and the query-result quality when dealing with stream disorder. Moreover, compared to existing disorder handling approaches, QDDH can significantly reduce the end-to-end latency, and at the same time provide users with desired query-result quality. In this dissertation, a generic buffer-based QDDH framework and three instantiations of the generic framework for distinct query types are presented. For tradeoffs caused by system limitations, this dissertation proposes a system-enhancement approach that combines the row-oriented and the column-oriented data layout and processing techniques in data stream processing to improve the throughput. To fully exploit the potential of such hybrid execution of continuous queries, a static, cost-based query optimizer is introduced. The optimizer works at the operator level and takes the unique property of execution plans of continuous queries—feasibility—into account. Read more info:eu-repo/classification/ddc/004 ddc:004 Datenstromverarbeitung Data Stream Processing
33	Fault Tolerant Distributed Complex Event Processing on Stream Computing Platforms Carbone, Paris January 2013 (has links) Recent advances in reliable distributed computing have made it possible to provide high availability and scalability to traditional systems and thus serve them as reliable services. For some systems, their parallel nature in addition to weak consistency requirements allowed a more trivial transision such as distributed storage, online data analysis, batch processing and distributed stream processing. On the other hand, systems such as Complex Event Processing (CEP) still maintain a monolithic architecture, being able to offer high expressiveness at the expense of low distribution. In this work, we address the main challenges of providing a highly-available Distributed CEP service with a focus on reliability, since it is the most crucial and untouched aspect of that transition. The experimental solution presented targets low average detection latency and leverages event delegation mechanisms present on existing stream execution platforms and in-memory logging to provide availability of any complex event processing abstraction on top via redundancy and partial recovery. distributed systems computer systems stream processing complex event processing Computer Systems Datorsystem
34	Spatio-Temporal Stream Reasoning with Adaptive State Stream Generation de Leng, Daniel January 2017 (has links) A lot of today's data is generated incrementally over time by a large variety of producers. This data ranges from quantitative sensor observations produced by robot systems to complex unstructured human-generated texts on social media. With data being so abundant, making sense of these streams of data through reasoning is challenging. Reasoning over streams is particularly relevant for autonomous robotic systems that operate in a physical environment. They commonly observe this environment through incremental observations, gradually refining information about their surroundings. This makes robust management of streaming data and its refinement an important problem. Many contemporary approaches to stream reasoning focus on the issue of querying data streams in order to generate higher-level information by relying on well-known database approaches. Other approaches apply logic-based reasoning techniques, which rarely consider the provenance of their symbolic interpretations. In this thesis, we integrate techniques for logic-based spatio-temporal stream reasoning with the adaptive generation of the state streams needed to do the reasoning over. This combination deals with both the challenge of reasoning over streaming data and the problem of robustly managing streaming data and its refinement. The main contributions of this thesis are (1) a logic-based spatio-temporal reasoning technique that combines temporal reasoning with qualitative spatial reasoning; (2) an adaptive reconfiguration procedure for generating and maintaining a data stream required to perform spatio-temporal stream reasoning over; and (3) integration of these two techniques into a stream reasoning framework. The proposed spatio-temporal stream reasoning technique is able to reason with intertemporal spatial relations by leveraging landmarks. Adaptive state stream generation allows the framework to adapt in situations in which the set of available streaming resources changes. Management of streaming resources is formalised in the DyKnow model, which introduces a configuration life-cycle to adaptively generate state streams. The DyKnow-ROS stream reasoning framework is a concrete realisation of this model that extends the Robot Operating System (ROS). DyKnow-ROS has been deployed on the SoftBank Robotics NAO platform to demonstrate the system's capabilities in the context of a case study on run-time adaptive reconfiguration. The results show that the proposed system – by combining reasoning over and reasoning about streams – can robustly perform spatio-temporal stream reasoning, even when the availability of streaming resources changes. / <p>The series name <em>Linköping Studies in Science and Technology Licentiate Thesis</em> is inocorrect. The correct series name is <em>Linköping Studies in Science and Technology Thesis</em>.</p> / NFFP6 / CENIIT Read more stream reasoning stream processing temporal reasoning spatial reasoning configuration planning intelligent robotics Computer Sciences Datavetenskap (datalogi)
35	Characterizing and Accelerating Deep Learning and Stream Processing Workloads using Roofline Trajectories Javed, Muhammad Haseeb January 2019 (has links) No description available. Computer Science Roofline model Deep Learning Kafka Stream Processing Message Broker RDMA TensorFlow
36	Hardware Utilisation Techniques for Data Stream Processing Meldrum, Max January 2019 (has links) Recent years have seen an increase in use of the stream processing architecture to compose continuous analytics applications. This thesis presents the design of a Rust-based stream processor that adopts two separate techniques to tackle existing weaknesses in modern production-grade stream processors. The first technique employs a data analytics language on top of the streaming runtime, in order to provide both dataflow and low-level compiler optimisations. This technique is motivated by an analysis of the impact that the lack of compiler integration may have on the end-to-end performance of streaming pipelines in Apache Flink. In the second technique streaming operators are scheduled using a task-parallel approach to boost performance for skewed data distributions. The experimental results for data-parallel streaming pipelines in this thesis demonstrate, that the scheduling model of the prototype achieves performance improvements in skewed scenarios without exhibiting any significant losses in performance during uniform distributions. / Under senare år har användningen av strömbearbetningsarkitekturen ökat för att komponera kontinuerliga analysapplikationer. Denna avhandling presenterar designen av en Rust-baserad strömprocessor som använder två separata tekniker för att hantera befintliga svagheter i moderna strömprocessorer. Den första tekniken använder ett dataanalysspråk ovanpå strömprocessorn, för att ge både dataflöde och kompilatoroptimeringar på låg nivå. Denna teknik är motiverad av en analys av påverkan som bristen på kompilatorintegration kan ha på den slutliga prestandan för analysapplikationer i Apache Flink. I den andra tekniken schemaläggs strömningsoperatörer med hjälp av en uppgiftsparallell metod för att öka prestanda för skev datadistribution. De experimentella resultaten för data-parallella analysapplikationer i denna avhandling visar att schemaläggningsmodellen för prototypen uppnår prestandaförbättringar i ojämna distributioner utan att uppvisa några betydande förluster i prestanda under enhetliga fördelningar. Read more Data Analytics Data Stream Processing Distributed Systems Computer and Information Sciences Data- och informationsvetenskap
37	Improving the performance of stream processing pipeline for vehicle data Gu, Wenyu January 2020 (has links) The growing amount of position-dependent data (containing both geo position data (i.e. latitude, longitude) and also vehicle/driver-related information) collected from sensors on vehicles poses a challenge to computer programs to process the aggregate amount of data from many vehicles. While handling this growing amount of data, the computer programs that process this data need to exhibit low latency and high throughput – as otherwise the value of the results of this processing will be reduced. As a solution, big data and cloud computing technologies have been widely adopted by industry. This thesis examines a cloud-based processing pipeline that processes vehicle location data. The system receives real-time vehicle data and processes the data in a streaming fashion. The goal is to improve the performance of this streaming pipeline, mainly with respect to latency and cost. The work began by looking at the current solution using AWS Kinesis and AWS Lambda. A benchmarking environment was created and used to measure the current system’s performance. Additionally, a literature study was conducted to find a processing framework that best meets both industrial and academic requirements. After a comparison, Flink was chosen as the new framework. A new solution was designed to use Fink. Next the performance of the current solution and the new Flink solution were compared using the same benchmarking environment and. The conclusion is that the new Flink solution has 86.2% lower latency while supporting triple the throughput of the current system at almost same cost. / Den växande mängden positionsberoende data (som innehåller både geo-positionsdata (dvs. latitud, longitud) och även fordons- / förarelaterad information) som samlats in från sensorer på fordon utgör en utmaning för datorprogram att bearbeta den totala mängden data från många fordon. Medan den här växande mängden data hanteras måste datorprogrammen som behandlar dessa datauppvisa låg latens och hög genomströmning - annars minskar värdet på resultaten av denna bearbetning. Som en lösning har big data och cloud computing-tekniker använts i stor utsträckning av industrin. Denna avhandling undersöker en molnbaserad bearbetningspipeline som bearbetar fordonsplatsdata. Systemet tar emot fordonsdata i realtid och behandlar data på ett strömmande sätt. Målet är att förbättra prestanda för denna strömmande pipeline, främst med avseende på latens och kostnad. Arbetet började med att titta på den nuvarande lösningen med AWS Kinesis och AWS Lambda. En benchmarking-miljö skapades och användes för att mäta det aktuella systemets prestanda. Dessutom genomfördes en litteraturstudie för att hitta en bearbetningsram som bäst uppfyller både industriella och akademiska krav. Efter en jämförelse valdes Flink som det nya ramverket. En nylösning designades för att använda Fink. Därefter jämfördes prestandan för den nuvarande lösningen och den nya Flink-lösningen med samma benchmarking-miljö och. Slutsatsen är att den nya Flink-lösningen har 86,2% lägre latens samtidigt som den stöder tredubbla kapaciteten för det nuvarande systemet till nästan samma kostnad. Read more Cloud computing Stream processing Flink AWS Molntjänster Strömbearbetning Flink AWS Computer and Information Sciences Data- och informationsvetenskap
38	Minimizing Overhead for Fault Tolerance in Event Stream Processing Systems Martin, André 20 September 2016 (has links) (PDF) Event Stream Processing (ESP) is a well-established approach for low-latency data processing enabling users to quickly react to relevant situations in soft real-time. In order to cope with the sheer amount of data being generated each day and to cope with fluctuating workloads originating from data sources such as Twitter and Facebook, such systems must be highly scalable and elastic. Hence, ESP systems are typically long running applications deployed on several hundreds of nodes in either dedicated data-centers or cloud environments such as Amazon EC2. In such environments, nodes are likely to fail due to software aging, process or hardware errors whereas the unbounded stream of data asks for continuous processing. In order to cope with node failures, several fault tolerance approaches have been proposed in literature. Active replication and rollback recovery-based on checkpointing and in-memory logging (upstream backup) are two commonly used approaches in order to cope with such failures in the context of ESP systems. However, these approaches suffer either from a high resource footprint, low throughput or unresponsiveness due to long recovery times. Moreover, in order to recover applications in a precise manner using exactly once semantics, the use of deterministic execution is required which adds another layer of complexity and overhead. The goal of this thesis is to lower the overhead for fault tolerance in ESP systems. We first present StreamMine3G, our ESP system we built entirely from scratch in order to study and evaluate novel approaches for fault tolerance and elasticity. We then present an approach to reduce the overhead of deterministic execution by using a weak, epoch-based rather than strict ordering scheme for commutative and tumbling windowed operators that allows applications to recover precisely using active or passive replication. Since most applications are running in cloud environments nowadays, we furthermore propose an approach to increase the system availability by efficiently utilizing spare but paid resources for fault tolerance. Finally, in order to free users from the burden of choosing the correct fault tolerance scheme for their applications that guarantees the desired recovery time while still saving resources, we present a controller-based approach that adapts fault tolerance at runtime. We furthermore showcase the applicability of our StreamMine3G approach using real world applications and examples. Read more Complex Event Processing CEP Event Stream Processing ESP Skalierung Migration Zustandsmanagement Fehlertoleranz Complex Event Processing CEP Event Stream Processing ESP Scalability Migration State Management Fault Tolerance ddc:004 rvk:ST 257 rvk:ST 200 rvk:ST 234 rvk:ST 233
39	Minimizing Overhead for Fault Tolerance in Event Stream Processing Systems Martin, André 17 December 2015 (has links) Event Stream Processing (ESP) is a well-established approach for low-latency data processing enabling users to quickly react to relevant situations in soft real-time. In order to cope with the sheer amount of data being generated each day and to cope with fluctuating workloads originating from data sources such as Twitter and Facebook, such systems must be highly scalable and elastic. Hence, ESP systems are typically long running applications deployed on several hundreds of nodes in either dedicated data-centers or cloud environments such as Amazon EC2. In such environments, nodes are likely to fail due to software aging, process or hardware errors whereas the unbounded stream of data asks for continuous processing. In order to cope with node failures, several fault tolerance approaches have been proposed in literature. Active replication and rollback recovery-based on checkpointing and in-memory logging (upstream backup) are two commonly used approaches in order to cope with such failures in the context of ESP systems. However, these approaches suffer either from a high resource footprint, low throughput or unresponsiveness due to long recovery times. Moreover, in order to recover applications in a precise manner using exactly once semantics, the use of deterministic execution is required which adds another layer of complexity and overhead. The goal of this thesis is to lower the overhead for fault tolerance in ESP systems. We first present StreamMine3G, our ESP system we built entirely from scratch in order to study and evaluate novel approaches for fault tolerance and elasticity. We then present an approach to reduce the overhead of deterministic execution by using a weak, epoch-based rather than strict ordering scheme for commutative and tumbling windowed operators that allows applications to recover precisely using active or passive replication. Since most applications are running in cloud environments nowadays, we furthermore propose an approach to increase the system availability by efficiently utilizing spare but paid resources for fault tolerance. Finally, in order to free users from the burden of choosing the correct fault tolerance scheme for their applications that guarantees the desired recovery time while still saving resources, we present a controller-based approach that adapts fault tolerance at runtime. We furthermore showcase the applicability of our StreamMine3G approach using real world applications and examples. Read more info:eu-repo/classification/ddc/004 ddc:004
40	External Streaming State Abstractions and Benchmarking / Extern strömmande statliga abstraktioner och benchmarking Sree Kumar, Sruthi January 2021 (has links) Distributed data stream processing is a popular research area and is one of the promising paradigms for faster and efficient data management. Application state is a first-class citizen in nearly every stream processing system. Nowadays, stream processing is, by definition, stateful. For a stream processing application, the state is backing operations such as aggregations, joins, and windows. Apache Flink is one of the most accepted and widely used stream processing systems in the industry. One of the main reasons engineers choose Apache Flink to write and deploy continuous applications is its unique combination of flexibility and scalability for stateful programmability, and the firm guarantee that the system ensures. Apache Flink’s guarantees always make its states correct and consistent even when nodes fail or when the number of tasks changes. Flink state can scale up to its compute node’s hard disk boundaries using embedded databases to store and retrieve data. Nevertheless, in all existing state backends officially supported by Flink, the state is always available locally to compute tasks. Even though this makes deployment more convenient, it creates other challenges such as non-trivial state reconfiguration and failure recovery. At the same time, compute, and state are bound to be tightly coupled. This strategy also leads to over-provisioning and is counterintuitive on state intensive only workloads or compute-intensive only workloads. This thesis investigates an alternative state backend architecture, FlinkNDB, which can tackle these challenges. FlinkNDB decouples state and computes by using a distributed database to store the state. The thesis covers the challenges of existing state backends and design choices and the new state backend implementation. We have evaluated the implementation of FlinkNDB against existing state backends offered by Apache Flink. / Distribuerad dataströmsbehandling är ett populärt forskningsområde och är ett av de lovande paradigmen för snabbare och effektivare datahantering. Applicationstate är en förstklassig medborgare i nästan alla strömbehandlingssystem. Numera är strömbearbetning per definition statlig. För en strömbehandlingsapplikation backar staten operationer som aggregeringar, sammanfogningar och windows. Apache Flink är ett av de mest accepterade och mest använda strömbehandlingssystemen i branschen. En av de främsta anledningarna till att ingenjörer väljer ApacheFlink för att skriva och distribuera kontinuerliga applikationer är dess unika kombination av flexibilitet och skalbarhet för statlig programmerbarhet, och företaget garanterar att systemet säkerställer. Apache Flinks garantier gör alltid dess tillstånd korrekt och konsekvent även när noder misslyckas eller när antalet uppgifter ändras. Flink-tillstånd kan skala upp till dess beräkningsnods hårddiskgränser genom att använda inbäddade databaser för att lagra och hämta data. I allmänna tillståndsstöd som officiellt stöds av Flink är staten dock alltid tillgänglig lokalt för att beräkna uppgifter. Även om detta gör installationen bekvämare, skapar det andra utmaningar som icke-trivial tillståndskonfiguration och felåterställning. Samtidigt måste beräkning och tillstånd vara tätt kopplade. Den här strategin leder också till överanvändning och är kontraintuitiv för statligt intensiva endast arbetsbelastningar eller beräkningsintensiva endast arbetsbelastningar. Denna avhandling undersöker en alternativ statsbackendarkitektur, FlinkNDB, som kan hantera dessa utmaningar. FlinkNDB frikopplar tillstånd och beräknar med hjälp av en distribuerad databas för att lagra tillståndet. Avhandlingen täcker utmaningarna med befintliga statliga backends och designval och den nya implementeringen av statebackend. Vi har utvärderat genomförandet av FlinkNDBagainst befintliga statliga backends som erbjuds av Apache Flink. Read more Apache Flink Distributed Systems NDB FlinkNDB State State Backends External State Stream Processing Systems Benchmarking Caching Apache Flink Distributed Systems NDB FlinkNDB State State Backends External State Stream Processing Systems Benchmarking Caching Computer and Information Sciences Data- och informationsvetenskap

Search results