Global ETD Search

Return to search

Efficient Distributed Processing Over Micro-batched Data Streams

<div><div><div><p>Advances in real-world applications require high-throughput processing over large data streams. Micro-batching is a promising computational model to support the needs of these applications. In micro-batching, the processing and batching of the data are interleaved, where the incoming data tuples are first buffered as data blocks, and then are processed collectively using parallel function constructs (e.g., Map-Reduce). The size of a micro-batch is set to guarantee a certain response-time latency that is to conform to the application’s service-level agreement. Compared to native tuple-at-a-time data stream processing, micro- batching can sustain higher data rates. However, existing micro-batch stream processing systems lack Load-awareness optimizations that are necessary to maintain performance and enhance resource utilization. In this thesis, we investigate the micro-batching paradigm and pinpoint some of its design principles that can benefit from further optimization. A new data partitioning scheme termed Prompt is presented that leverages the characteristics of the micro-batch processing model. Prompt enables a balanced input to the batching and processing cycles of the micro-batching model. Prompt achieves higher throughput process- ing with an increase in resource utilization. Moreover, Prompt+ is proposed to enforce la- tency by elastically adapting resource consumption according to workload changes. More specifically, Prompt+ employs a scheduling strategy that supports elasticity in response to workload changes while avoiding rescheduling bottlenecks. Moreover, we envision the use of deep reinforcement learning to efficiently partition data in distributed streaming systems. PartLy demonstrates the use of artificial neural networks to facilitate the learning of efficient partitioning policies that match the dynamic nature of streaming workloads. Finally, all the proposed techniques are abstracted and generalized over three widely used stream process- ing engines. Experimental results using real and synthetic data sets demonstrate that the proposed techniques are robust against fluctuations in data distribution and arrival rates. Furthermore, it achieves up to 5x improvement in system throughput over state-of-the-art techniques without degradation in latency.</p></div></div></div>

10.25394/pgs.14386517.v1

Database Management

Distributed Stream Processing

Identifer	oai:union.ndltd.org:purdue.edu/oai:figshare.com:article/14386517
Date	07 May 2021
Creators	Ahmed Abdelhamid (10539053)
Source Sets	Purdue University
Detected Language	English
Type	Text, Thesis
Rights	CC BY 4.0
Relation	https://figshare.com/articles/thesis/Efficient_Distributed_Processing_Over_Micro-batched_Data_Streams/14386517

Page generated in 0.0029 seconds

Efficient Distributed Processing Over Micro-batched Data Streams

Description

Links & Downloads

Tags

Additional Fields