Spelling suggestions: "subject:"4hupstream aprocessing"" "subject:"4hupstream eprocessing""
1 |
Efficient Distributed Processing Over Micro-batched Data StreamsAhmed Abdelhamid (10539053) 07 May 2021 (has links)
<div><div><div><p>Advances in real-world applications require high-throughput processing over large data streams. Micro-batching is a promising computational model to support the needs of these applications. In micro-batching, the processing and batching of the data are interleaved, where the incoming data tuples are first buffered as data blocks, and then are processed collectively using parallel function constructs (e.g., Map-Reduce). The size of a micro-batch is set to guarantee a certain response-time latency that is to conform to the application’s service-level agreement. Compared to native tuple-at-a-time data stream processing, micro- batching can sustain higher data rates. However, existing micro-batch stream processing systems lack Load-awareness optimizations that are necessary to maintain performance and enhance resource utilization. In this thesis, we investigate the micro-batching paradigm and pinpoint some of its design principles that can benefit from further optimization. A new data partitioning scheme termed Prompt is presented that leverages the characteristics of the micro-batch processing model. Prompt enables a balanced input to the batching and processing cycles of the micro-batching model. Prompt achieves higher throughput process- ing with an increase in resource utilization. Moreover, Prompt+ is proposed to enforce la- tency by elastically adapting resource consumption according to workload changes. More specifically, Prompt+ employs a scheduling strategy that supports elasticity in response to workload changes while avoiding rescheduling bottlenecks. Moreover, we envision the use of deep reinforcement learning to efficiently partition data in distributed streaming systems. PartLy demonstrates the use of artificial neural networks to facilitate the learning of efficient partitioning policies that match the dynamic nature of streaming workloads. Finally, all the proposed techniques are abstracted and generalized over three widely used stream process- ing engines. Experimental results using real and synthetic data sets demonstrate that the proposed techniques are robust against fluctuations in data distribution and arrival rates. Furthermore, it achieves up to 5x improvement in system throughput over state-of-the-art techniques without degradation in latency.</p></div></div></div>
|
2 |
Streaming Ray Tracer na GPU / Streaming Ray Tracer on GPUDvořák, Jakub January 2008 (has links)
Current consumer GPUs can be used as high performance stream processors and are a tempting platform to be used to implement raytracing. In this paper I briefly present raytracing principles and methods used to accelerate it, modern GPUs programmable pipeline and examples of its use. I describe stream processing in general and available interfaces enabling the usage of GPU as stream processor. Then I present my GPU raytracer implementation, used algorithms and experiments I have made.
|
3 |
Compiling Data Dependent Control Flow on SIMD GPUsPopa, Tiberiu January 2004 (has links)
Current Graphic Processing Units (GPUs) (circa. 2003/2004) have programmable vertex and fragment units. Often these units are implemented as SIMD processors employing parallel pipelines. Data dependent conditional execution on SIMD architectures implemented using processor idling is inefficient. I propose a multi-pass approach based on conditional streams which allows dynamic load balancing of the fragment units of the GPU and better theoretical performance on programs using data dependent conditionals and loops. The proposed system can be used to turn the fragment unit of a SIMD GPU into a stream processor with data dependent control flow.
|
4 |
An Advanced Volume Raycasting Technique using GPU Stream ProcessingMensmann, Jörg, Ropinski,, Timo, Hinrichs, Klaus January 2010 (has links)
GPU-based raycasting is the state-of-the-art rendering technique for interactive volume visualization. The ray traversal is usually implemented in a fragment shader, utilizing the hardware in a way that was not originally intended. New programming interfaces for stream processing, such as CUDA, support a more general programming model and the use of additional device features, which are not accessible through traditional shader programming. In this paper we propose a slab-based raycasting technique that is modeled specifically to use these features to accelerate volume rendering. This technique is based on experience gained from comparing fragment shader implementations of basic raycasting to implementations directly translated to CUDA kernels. The comparison covers direct volume rendering with a variety of optional features, e.g., gradient and lighting calculations. Our findings are supported by benchmarks of typical volume visualization scenarios. We conclude that new stream processing models can only gain a small performance advantage when directly porting the basic raycasting algorithm. However, they can be advantageous through novel acceleration methods which use the hardware features not available to shader implementations.
|
5 |
Compiling Data Dependent Control Flow on SIMD GPUsPopa, Tiberiu January 2004 (has links)
Current Graphic Processing Units (GPUs) (circa. 2003/2004) have programmable vertex and fragment units. Often these units are implemented as SIMD processors employing parallel pipelines. Data dependent conditional execution on SIMD architectures implemented using processor idling is inefficient. I propose a multi-pass approach based on conditional streams which allows dynamic load balancing of the fragment units of the GPU and better theoretical performance on programs using data dependent conditionals and loops. The proposed system can be used to turn the fragment unit of a SIMD GPU into a stream processor with data dependent control flow.
|
6 |
BALLWORLD: A FRAMEWORK FOR LEARNING STATISTICAL INFERENCE AND STREAM PROCESSINGRavali, Yeluri January 2017 (has links)
No description available.
|
7 |
Leistungsoptimierung der persistenten Datenverwaltung in DSP-Architekturen zur Live-Analyse von SensordatenWeißbach, Manuel 28 October 2021 (has links)
Aufgrund der in vielen Bereichen stets wachsenden Menge an zu verarbeitenden Daten haben sich Big-Data-Anwendungen in den letzten Jahren zunehmend verbreitet. Twitter gab bereits im Jahr 2011 an, täglich 15 Millionen URLs in Echtzeit zu untersuchen, um die Verbreitung von Spamlinks zu unterbinden [1]. Facebook verarbeitet pro Minute über vier Millionen „Gefällt mir“-Klicks und verwaltet über 300 Petabyte Daten [2]. Über das Businessportal LinkedIn wurden 2011 rund eine Milliarde Nachrichten pro Tag zugestellt, 2015 waren es laut Angaben des Unternehmens bereits 1,1 Billionen täglich versendete Nachrichten [3]. Diesem starken Anstieg liegt ein exponentielles Wachstum zugrunde, das für Big Data typisch ist.
Gartner definiert den Begriff „Big Data“ auf Basis seiner spezifischen Eigenschaften, die in englischer Sprache auch als die „drei V´s“ bezeichnet werden: „Volume“, „Variety“ und „Velocity“ [4]. Neben der enormen Menge an zu verarbeitenden Daten („Volume“) und ihrer Vielfalt und Unstrukturiertheit („Variety“), ist demnach auch die Geschwindigkeit („Velocity“), in der die Daten generiert werden, ein wesentliches Merkmal von Big Data [5, 6]. Soll trotz der ständigen und immer schneller werdenden Generierung neuer Daten ein Verarbeitungsrückstau vermieden werden, so folgt daraus auch die Notwendigkeit, die kontinuierlich wachsenden Datenmengen immer schneller zu verarbeiten.
|
8 |
SmartCell: An Energy Efficient Reconfigurable Architecture for Stream ProcessingLiang, Cao 04 May 2009 (has links)
Data streaming applications, such as signal processing, multimedia applications, often require high computing capacity, yet also have stringent power constraints, especially in portable devices. General purpose processors can no longer meet these requirements due to their sequential software execution. Although fixed logic ASICs are usually able to achieve the best performance and energy efficiency, ASIC solutions are expensive to design and their lack of flexibility makes them unable to accommodate functional changes or new system requirements. Reconfigurable systems have long been proposed to bridge the gap between the flexibility of software processors and performance of hardware circuits. Unfortunately, mainstream reconfigurable FPGA designs suffer from high cost of area, power consumption and speed due to the routing area overhead and timing penalty of their bit-level fine granularity. In this dissertation, we present an architecture design, application mapping and performance evaluation of a novel coarse-grained reconfigurable architecture, named SmartCell, for data streaming applications. The system tiles a large number of computing cell units in a 2D mesh structure, with four coarse-grained processing elements developed inside each cell to form a quad structure. Based on this structure, a hierarchical reconfigurable network is developed to provide flexible on-chip communication among computing resources: including fully connected crossbar, nearest neighbor connection and clustered mesh network. SmartCell can be configured to operate in various computing modes, including SIMD, MIMD and systolic array styles to fit for different application requirements. The coarse-grained SmartCell has the potential to improve the power and energy efficiency compared with fine-grained FPGAs. It is also able to provide high performance comparable to the fixed function ASICs through deep pipelining and large amount of computing parallelism. Dynamic reconfiguration is also addressed in this dissertation. To evaluate its performance, a set of benchmark applications has been successfully mapped onto the SmartCell system, ranging from signal processing, multimedia applications to scientific computing and data encryption. A 4 by 4 SmartCell prototype system was initially designed in CMOS standard cell ASIC with 130 nm process. The chip occupies 8.2 mm square and dissipates 1.6 mW/MHz under fully operation. The results show that the SmartCell can bridge the performance and flexibility gap between logic specific ASICs and reconfigurable FPGAs. SmartCell is also about 8% and 69% more energy efficient and achieves 4x and 2x throughput gains compared with Montium and RaPiD CGRAs. Based on our first SmartCell prototype experiences, an improved SmartCell-II architecture was developed, which includes distributed data memory, segmented instruction format and improved dynamic configuration schemes. A novel parallel FFT algorithm with balanced workloads and optimized data flow was also proposed and successfully mapped onto SmartCell-II for performance evaluations. A 4 by 4 SmartCell-II prototype was then synthesized into standard cell ASICs with 90 nm process. The results show that SmartCell-II consists of 2.0 million gates and is fully functional at up to 295 MHz with 3.1 mW/MHz power consumption. SmartCell-II is about 3.6 and 28.9 times more energy efficient than Xilinx FPGA and TI's high performance DSPs, respectively. It is concluded that the SmartCell is able to provide a promising solution to achieve high performance and energy efficiency for future data streaming applications.
|
9 |
A distributed framework for situation awareness on camera networksHong, Kirak 27 August 2014 (has links)
With the proliferation of cameras and advanced video analytics, situation awareness applications that automatically generate actionable knowledge from live camera streams has become an important class of applications in various domains including surveillance, marketing, sports, health care, and traffic monitoring. However, despite the wide range of use cases, developing those applications on large-scale camera networks is extremely challenging because it involves both compute- and data-intensive workloads, has latency-sensitive quality of service requirement, and deals with inherent dynamism (e.g., number of faces detected in a certain area) from the real world. To support developing large-scale situation awareness applications, this dissertation presents a distributed framework that makes two key contributions: 1) it provides a programming model that ensures scalability of applications and 2) it supports low-latency computation and dynamic workload handling through opportunistic event processing and workload distribution over different locations and network hierarchy.
To provide a scalable programming model, two programming abstractions for different levels of application logic are proposed: the first abstraction at the level of real-time target detection and tracking, and the second abstraction for answering spatio-temporal queries at a higher level. The first programming abstraction, Target Container (TC), elevates target as a first-class citizen, allowing domain experts to simply provide handlers for detection, tracking, and comparison of targets. With those handlers, TC runtime system performs priority-aware scheduling to ensure real-time tracking of important targets when resources are not enough to track all targets. The second abstraction, Spatio-temporal Analysis (STA) supports applications to answer queries related to space, time, and occupants using a global state transition table and probabilistic events. To ensure scalability, STA supports bounded communication overhead of state update by providing tuning parameters for the information propagation among distributed workers.
The second part of this work explores two optimization strategies that reduce latency for stream processing and handle dynamic workload. The first strategy, an opportunistic event processing mechanism, performs event processing on predicted locations to provide just-in-time situational information to mobile users. Since location prediction algorithms are inherently inaccurate, the system selects multiple regions using a greedy algorithm to provide highly meaningful information at the given amount of computing resources. The second strategy is to distribute application workload over computing resources that are placed at different locations and various levels of network hierarchy. To support this strategy, the framework provides hierarchical communication primitives and a decentralized resource discovery protocol that allow scalable and highly adaptive load balancing over space and time.
|
10 |
Efficient Multi-Core Implementation of the IPsec Encapsulating Security Payload Protocol for a Single Security Association / Effektiv, flerkärnig implementation av IPsec Encapsulating Security Payload protokollet för en Security AssociationHellsing, Mattias, Albin, Odervall January 2018 (has links)
As the mobile Internet traffic increases, the workload of the base stations processing this traffic increases with it. To cope with this, the telecommunication providers responsible for the systems deployed in these base stations have looked to parallelism. This, together with the fact that these providers have a vested interest in protecting their users' data from potential attackers, means that there is a need for efficient parallel packet processing software which handles encryption as well as authentication. A well known protocol for encryption and authentication of IP packets is the Encapsulating Security Payload (ESP) protocol of the IPsec protocol suite. IPsec establishes simplex connections, called Security Associations (SA), between entities that wish to communicate. This thesis investigates a special case of this problem where the work of encrypting and authenticating the packets within a single SA is parallelized. This problem was investigated by developing and comparing two multi-threaded implementations based on the Eventdev, an event driven programming library, and ring buffer libraries of Data Plane Development Kit (DPDK). One additional Eventdev-based implementation was also investigated which schedules linked lists of packets, instead of single packets, in an attempt to reduce the overhead of scheduling packets to the worker cores. These implementations were then evaluated in terms of throughput, latency, speedup, and last level cache miss rates. The results showed that the ring buffer-based implementation performed the best in all metrics while the single packet-scheduling Eventdev-based implementation was outperformed by the one using linked lists of packets. It was shown that the packet generation, which was done by the receiving core, was the main limiting factor for all implementations. In addition, the memory resources such as the memory bus, memory controller and prefetching hardware were shown to likely be an area of contention and a possible bottleneck as the packet generation rate increases. The conclusion drawn from this was that a parallelized packet retrieval solution such as Receive Side Scaling (RSS) together with minimizing memory resource contention is necessary to further improve performance.
|
Page generated in 0.0642 seconds