1 |
Efficient Distributed Processing Over Micro-batched Data StreamsAhmed Abdelhamid (10539053) 07 May 2021 (has links)
<div><div><div><p>Advances in real-world applications require high-throughput processing over large data streams. Micro-batching is a promising computational model to support the needs of these applications. In micro-batching, the processing and batching of the data are interleaved, where the incoming data tuples are first buffered as data blocks, and then are processed collectively using parallel function constructs (e.g., Map-Reduce). The size of a micro-batch is set to guarantee a certain response-time latency that is to conform to the application’s service-level agreement. Compared to native tuple-at-a-time data stream processing, micro- batching can sustain higher data rates. However, existing micro-batch stream processing systems lack Load-awareness optimizations that are necessary to maintain performance and enhance resource utilization. In this thesis, we investigate the micro-batching paradigm and pinpoint some of its design principles that can benefit from further optimization. A new data partitioning scheme termed Prompt is presented that leverages the characteristics of the micro-batch processing model. Prompt enables a balanced input to the batching and processing cycles of the micro-batching model. Prompt achieves higher throughput process- ing with an increase in resource utilization. Moreover, Prompt+ is proposed to enforce la- tency by elastically adapting resource consumption according to workload changes. More specifically, Prompt+ employs a scheduling strategy that supports elasticity in response to workload changes while avoiding rescheduling bottlenecks. Moreover, we envision the use of deep reinforcement learning to efficiently partition data in distributed streaming systems. PartLy demonstrates the use of artificial neural networks to facilitate the learning of efficient partitioning policies that match the dynamic nature of streaming workloads. Finally, all the proposed techniques are abstracted and generalized over three widely used stream process- ing engines. Experimental results using real and synthetic data sets demonstrate that the proposed techniques are robust against fluctuations in data distribution and arrival rates. Furthermore, it achieves up to 5x improvement in system throughput over state-of-the-art techniques without degradation in latency.</p></div></div></div>
|
2 |
Benchmarking and Scheduling Strategies for Distributed Stream ProcessingShukla, Anshu January 2017 (has links) (PDF)
The velocity dimension of Big Data refers to the need to rapidly process data that arrives continuously as streams of messages or events. Distributed Stream Processing Systems (DSPS) refer to distributed programming and runtime platforms that allow users to define a composition of dataflow logic that are executed on distributed resources over streams of incoming messages.
A DSPS uses commodity clusters and Cloud Virtual Machines (VMs) for its execution. In order to meet the required performance for these applications, the DSPS needs to schedule these dataßows efficiently over the resources. Despite their growing use, resource scheduling for DSPSÕs tends to be done in an ad hoc manner, favoring empirical and reactive approaches, rather than a model-driven and analytical approach. Such empirical strategies may arrive at an approximate schedule for the dataflow that needs further tuning to meet the quality of service.
We propose a model-based scheduling approach that makes use of performance profiles and benchmarks developed for tasks in the dataßow to plan both the resource allocation and the resource mapping that together form the schedule planning process. We propose the Model Based Allocation (MBA) and the Slot Aware Mapping (SAM) approaches that efectively utilize knowledge of the performance model of logic tasks to provide an efficient and predictable scheduling behavior. We implemented and validate these algorithms using the popular open source Apache Storm DSPS for several micro and application dataflows. The results show that our model-driven approach is able to reduce the amount of required resources (VMs) by 30% − 50% relative to existing techniques. Also we see that our strategies o↵er a predictable behavior that ensures that the expected and actual rates supported and resources used match closely. This can enable deterministic schedule planning even under dynamic conditions.
Besides this static scheduling, we also examine the ability to dynamically consolidate tasks onto fewer VMs when the load on the dataßow decreases or the VMs get fragmented. We propose reliable task migration models for Apache Storm dataßows that are able to rapidly move the task assignment in the cluster, and resume the dataflow execution without any message loss.
|
3 |
Towards Unifying Stream Processing over Central and Near-the-Edge Data CentersPeiro Sajjad, Hooman January 2016 (has links)
In this thesis, our goal is to enable and achieve effective and efficient real-time stream processing in a geo-distributed infrastructure, by combining the power of central data centers and micro data centers. Our research focus is to address the challenges of distributing the stream processing applications and placing them closer to data sources and sinks. We enable applications to run in a geo-distributed setting and provide solutions for the network-aware placement of distributed stream processing applications across geo-distributed infrastructures. First, we evaluate Apache Storm, a widely used open-source distributed stream processing system, in the community network Cloud, as an example of a geo-distributed infrastructure. Our evaluation exposes new requirements for stream processing systems to function in a geo-distributed infrastructure. Second, we propose a solution to facilitate the optimal placement of the stream processing components on geo-distributed infrastructures. We present a novel method for partitioning a geo-distributed infrastructure into a set of computing clusters, each called a micro data center. According to our results, we can increase the minimum available bandwidth in the network and likewise, reduce the average latency to less than 50%. Next, we propose a parallel and distributed graph partitioner, called HoVerCut, for fast partitioning of streaming graphs. Since a lot of data can be presented in the form of graph, graph partitioning can be used to assign the graph elements to different data centers to provide data locality for efficient processing. Last, we provide an approach, called SpanEdge that enables stream processing systems to work on a geo-distributed infrastructure. SpenEdge unifies stream processing over the central and near-the-edge data centers (micro data centers). As a proof of concept, we implement SpanEdge by extending Apache Storm that enables it to run across multiple data centers. / <p>QC 20161005</p>
|
4 |
[pt] CEP DISTRIBUÍDO PARA AQUISIÇÃO E PROCESSAMENTO DE INFORMAÇÃO ADAPTATIVOS CIENTES DE CONTEXTO / [en] DISTRIBUTED CEP FOR CONTEXT-AWARE ADAPTIVE ACQUIREMENT AND PROCESSING OF INFORMATIONFERNANDO BENEDITO VERAS MAGALHAES 07 June 2021 (has links)
[pt] A disseminação atual da IoT aumenta a implantação de soluções de processamento de fluxo de dados para monitorar e controlar elementos do mundo real. Uma dessas soluções é o Processamento de Eventos Complexos (CEP). Inicialmente, um único computador ou cluster concentraria toda a execução do CEP. No entanto, a execução centralizada do CEP não é ideal para lidar com o alto volume, velocidade e volatilidade dos fluxos de dados dos sensores IoT. Em vez disso, as aplicações CEP devem criar e decentralizar o processamento de eventos CEP, de preferência tendo agentes CEP na nuvem e em dispositivos na borda. Além disso, tão importante quanto a descentralização, é decidir como o processamento será dividido
entre esses dispositivos. Dito isso, estar ciente do contexto atual de cada dispositivo, por exemplo, sua localização e sensores disponíveis, pode ajudar a coletar e (parcialmente) processar os dados em dispositivos próximos ao local onde os dados foram produzidos. Este trabalho apresenta uma plataforma de CEP distribuído com ciência de contexto chamada Global CEP Manager (GCM). GCM é um serviço do middleware ContextNet que oferece suporte à implantação e ao rearranjo dinâmico de consultas CEP baseados em contexto para motores CEP em execução na nuvem, em dispositivos na borda estacionários e M-Hubs, que são dispositivos na borda móveis do ContextNet. O GCM usa o ContextMatcher, que também faz parte deste trabalho. ContextMatcher é um módulo para aplicações ContextNet que permite a entrega de mensagens para nós cujo contexto esteja de compatível com um determinado conjunto de características contextuais. / [en] The current dissemination of IoT increases the deployment of stream processing solutions for monitoring and controlling elements of the real world. One of those solutions is Complex Event Processing (CEP). Initially, a single computer/cluster would concentrate all the CEP execution. However, a centralized execution of CEP is not suitable for coping with the high volume, velocity, and volatility of IoT sensors’ data streams. Instead, applications using CEP should deploy a distributed CEP Event Processing Network, preferably having CEP agents both in the cloud and at edge devices. Also, deciding the arrangement used to split the processing among these tiers and their devices can be just as important. That said, being
aware of each of the devices current context, for instance, their location and available sensors, can help to collect and (partially) process the data on devices close to the data s production site. This work presents a contextaware distributed CEP platform called Global CEP Manager (GCM). GCM is a service of the ContextNet middleware that supports the context-based deployment, and dynamic rearrangement of CEP queries to CEP engines executing in the cloud, stationary edge devices, and M-Hubs, which are
ContextNet s mobile edge devices. GCM uses the ContextMatcher, which is also part of this work. ContextMatcher is a module for ContextNet applications that enables the delivery of messages for nodes that match a specified set of contextual requirements.
|
Page generated in 0.1368 seconds