Global ETD Search

1	Leistungsoptimierung der persistenten Datenverwaltung in DSP-Architekturen zur Live-Analyse von Sensordaten Weißbach, Manuel 28 October 2021 (has links) Aufgrund der in vielen Bereichen stets wachsenden Menge an zu verarbeitenden Daten haben sich Big-Data-Anwendungen in den letzten Jahren zunehmend verbreitet. Twitter gab bereits im Jahr 2011 an, täglich 15 Millionen URLs in Echtzeit zu untersuchen, um die Verbreitung von Spamlinks zu unterbinden [1]. Facebook verarbeitet pro Minute über vier Millionen „Gefällt mir“-Klicks und verwaltet über 300 Petabyte Daten [2]. Über das Businessportal LinkedIn wurden 2011 rund eine Milliarde Nachrichten pro Tag zugestellt, 2015 waren es laut Angaben des Unternehmens bereits 1,1 Billionen täglich versendete Nachrichten [3]. Diesem starken Anstieg liegt ein exponentielles Wachstum zugrunde, das für Big Data typisch ist. Gartner definiert den Begriff „Big Data“ auf Basis seiner spezifischen Eigenschaften, die in englischer Sprache auch als die „drei V´s“ bezeichnet werden: „Volume“, „Variety“ und „Velocity“ [4]. Neben der enormen Menge an zu verarbeitenden Daten („Volume“) und ihrer Vielfalt und Unstrukturiertheit („Variety“), ist demnach auch die Geschwindigkeit („Velocity“), in der die Daten generiert werden, ein wesentliches Merkmal von Big Data [5, 6]. Soll trotz der ständigen und immer schneller werdenden Generierung neuer Daten ein Verarbeitungsrückstau vermieden werden, so folgt daraus auch die Notwendigkeit, die kontinuierlich wachsenden Datenmengen immer schneller zu verarbeiten. info:eu-repo/classification/ddc/004 ddc:004
2	Efficient Multi-Core Implementation of the IPsec Encapsulating Security Payload Protocol for a Single Security Association / Effektiv, flerkärnig implementation av IPsec Encapsulating Security Payload protokollet för en Security Association Hellsing, Mattias, Albin, Odervall January 2018 (has links) As the mobile Internet traffic increases, the workload of the base stations processing this traffic increases with it. To cope with this, the telecommunication providers responsible for the systems deployed in these base stations have looked to parallelism. This, together with the fact that these providers have a vested interest in protecting their users' data from potential attackers, means that there is a need for efficient parallel packet processing software which handles encryption as well as authentication. A well known protocol for encryption and authentication of IP packets is the Encapsulating Security Payload (ESP) protocol of the IPsec protocol suite. IPsec establishes simplex connections, called Security Associations (SA), between entities that wish to communicate. This thesis investigates a special case of this problem where the work of encrypting and authenticating the packets within a single SA is parallelized. This problem was investigated by developing and comparing two multi-threaded implementations based on the Eventdev, an event driven programming library, and ring buffer libraries of Data Plane Development Kit (DPDK). One additional Eventdev-based implementation was also investigated which schedules linked lists of packets, instead of single packets, in an attempt to reduce the overhead of scheduling packets to the worker cores. These implementations were then evaluated in terms of throughput, latency, speedup, and last level cache miss rates. The results showed that the ring buffer-based implementation performed the best in all metrics while the single packet-scheduling Eventdev-based implementation was outperformed by the one using linked lists of packets. It was shown that the packet generation, which was done by the receiving core, was the main limiting factor for all implementations. In addition, the memory resources such as the memory bus, memory controller and prefetching hardware were shown to likely be an area of contention and a possible bottleneck as the packet generation rate increases. The conclusion drawn from this was that a parallelized packet retrieval solution such as Receive Side Scaling (RSS) together with minimizing memory resource contention is necessary to further improve performance. Computer Sciences Datavetenskap (datalogi)
3	Implementierung und Evaluierung einer Verarbeitung von Datenströmen im Big Data Umfeld am Beispiel von Apache Flink Oelschlegel, Jan 17 May 2021 (has links) Die Verarbeitung von Datenströmen rückt zunehmend in den Fokus beim Aufbau moderner Big Data Infrastrukturen. Der Praxispartner dieser Master-Thesis, die integrationfactory GmbH & Co. KG, möchte zunehmend den Big Data Bereich ausbauen, um den Kunden auch in diesen Aspekten als Beratungshaus Unterstützung bieten zu können. Der Fokus wurde von Anfang an auf Apache Flink gelegt, einem aufstrebenden Stream-Processing-Framework. Das Ziel dieser Arbeit ist die Implementierung verschiedener typischer Anwendungsfälle des Unternehmens mithilfe von Flink und die anschließende Evaluierung dieser. Im Rahmen dessen wird am Anfang zunächst die zentrale Problemstellung festgehalten und daraus die Zielstellungen abgeleitet. Zum besseren Verständnis werden im Nachgang wichtige Grundbegriffe und Konzepte vermittelt. Es wird außerdem dem Framework ein eigenes Kapitel gewidmet, um den Leser einen umfangreichen aber dennoch kompakten Einblick in Flink zu geben. Dabei wurde auf verschiedene Quellen eingegangen, mitunter wurde auch ein direkter Kontakt mit aktiven Entwicklern des Frameworks aufgebaut. Dadurch konnten zunächst unklare Sachverhalte durch fehlende Informationen aus den Primärquellen im Nachgang geklärt und aufbereitet in das Kapitel hinzugefügt werden. Im Hauptteil der Arbeit wird eine Implementierung von definierten Anwendungsfällen vorgenommen. Dabei kommen die Datastream-API und FlinkSQL zum Einsatz, dessen Auswahl auch begründet wird. Die Ausführung der programmierten Jobs findet im firmeneigenen Big Data Labor statt, einer virtualisierten Umgebung zum Testen von Technologien. Als zentrales Problem dieser Master-Thesis sollen beide Schnittstellen auf die Eignung hinsichtlich der Anwendungsfälle evaluiert werden. Auf Basis des Wissens aus den Grundlagen-Kapiteln und der Erfahrungen aus der Entwicklung der Jobs werden Kriterien zur Bewertung mithilfe des Analytic Hierarchy Processes aufgestellt. Im Nachgang findet eine Auswertung statt und die Einordnung des Ergebnisses.:1. Einleitung 1.1. Motivation 1.2. Problemstellung 1.3. Zielsetzung 2. Grundlagen 2.1. Begriffsdefinitionen 2.1.1. Big Data 2.1.2. Bounded vs. unbounded Streams 2.1.3. Stream vs. Tabelle 2.2. Stateful Stream Processing 2.2.1. Historie 2.2.2. Anforderungen 2.2.3. Pattern-Arten 2.2.4. Funktionsweise zustandsbehafteter Datenstromverarbeitung 3. Apache Flink 3.1. Historie 3.2. Architektur 3.3. Zeitabhängige Verarbeitung 3.4. Datentypen und Serialisierung 3.5. State Management 3.6. Checkpoints und Recovery 3.7. Programmierschnittstellen 3.7.1. DataStream-API 3.7.2. FlinkSQL & Table-API 3.7.3. Integration mit Hive 3.8. Deployment und Betrieb 4. Implementierung 4.1. Entwicklungsumgebung 4.2. Serverumgebung 4.3. Konfiguration von Flink 4.4. Ausgangsdaten 4.5. Anwendungsfälle 4.6. Umsetzung in Flink-Jobs 4.6.1. DataStream-API 4.6.2. FlinkSQL 4.7. Betrachtung der Resultate 5. Evaluierung 5.1. Analytic Hierarchy Process 5.1.1. Ablauf und Methodik 5.1.2. Phase 1: Problemstellung 5.1.3. Phase 2: Struktur der Kriterien 5.1.4. Phase 3: Aufstellung der Vergleichsmatrizen 5.1.5. Phase 4: Bewertung der Alternativen 5.2. Auswertung des AHP 6. Fazit und Ausblick 6.1. Fazit 6.2. Ausblick
4	Scalable Validation of Data Streams Xu, Cheng January 2016 (has links) In manufacturing industries, sensors are often installed on industrial equipment generating high volumes of data in real-time. For shortening the machine downtime and reducing maintenance costs, it is critical to analyze efficiently this kind of streams in order to detect abnormal behavior of equipment. For validating data streams to detect anomalies, a data stream management system called SVALI is developed. Based on requirements by the application domain, different stream window semantics are explored and an extensible set of window forming functions are implemented, where dynamic registration of window aggregations allow incremental evaluation of aggregate functions over windows. To facilitate stream validation on a high level, the system provides two second order system validation functions, model-and-validate and learn-and-validate. Model-and-validate allows the user to define mathematical models based on physical properties of the monitored equipment, while learn-and-validate builds statistical models by sampling the stream in real-time as it flows. To validate geographically distributed equipment with short response time, SVALI is a distributed system where many SVALI instances can be started and run in parallel on-board the equipment. Central analyses are made at a monitoring center where streams of detected anomalies are combined and analyzed on a cluster computer. SVALI is an extensible system where functions can be implemented using external libraries written in C, Java, and Python without any modifications of the original code. The system and the developed functionality have been applied on several applications, both industrial and for sports analytics. Data Stream Management Distributed Data Stream Processing Data Stream Validation Anomaly Detection
5	Quality-of-Service-Aware Data Stream Processing Schmidt, Sven 21 March 2007 (has links) (PDF) Data stream processing in the industrial as well as in the academic field has gained more and more importance during the last years. Consider the monitoring of industrial processes as an example. There, sensors are mounted to gather lots of data within a short time range. Storing and post-processing these data may occasionally be useless or even impossible. On the one hand, only a small part of the monitored data is relevant. To efficiently use the storage capacity, only a preselection of the data should be considered. On the other hand, it may occur that the volume of incoming data is generally too high to be stored in time or–in other words–the technical efforts for storing the data in time would be out of scale. Processing data streams in the context of this thesis means to apply database operations to the stream in an on-the-fly manner (without explicitly storing the data). The challenges for this task lie in the limited amount of resources while data streams are potentially infinite. Furthermore, data stream processing must be fast and the results have to be disseminated as soon as possible. This thesis focuses on the latter issue. The goal is to provide a so-called Quality-of-Service (QoS) for the data stream processing task. Therefore, adequate QoS metrics like maximum output delay or minimum result data rate are defined. Thereafter, a cost model for obtaining the required processing resources from the specified QoS is presented. On that basis, the stream processing operations are scheduled. Depending on the required QoS and on the available resources, the weight can be shifted among the individual resources and QoS metrics, respectively. Calculating and scheduling resources requires a lot of expert knowledge regarding the characteristics of the stream operations and regarding the incoming data streams. Often, this knowledge is based on experience and thus, a revision of the resource calculation and reservation becomes necessary from time to time. This leads to occasional interruptions of the continuous data stream processing, of the delivery of the result, and thus, of the negotiated Quality-of-Service. The proposed robustness concept supports the user and facilitates a decrease in the number of interruptions by providing more resources. data stream processing quality-of-service robustness Datenstromverarbeitung Qualität Robustheit ddc:004 rvk:ST 274 Datenstrom Datenverarbeitung Dienstgüte
6	Handling Tradeoffs between Performance and Query-Result Quality in Data Stream Processing Ji, Yuanzhen 27 March 2018 (has links) (PDF) Data streams in the form of potentially unbounded sequences of tuples arise naturally in a large variety of domains including finance markets, sensor networks, social media, and network traffic management. The increasing number of applications that require processing data streams with high throughput and low latency have promoted the development of data stream processing systems (DSPS). A DSPS processes data streams with continuous queries, which are issued once and return query results to users continuously as new tuples arrive. For stream-based applications, both the query-execution performance (in terms of, e.g., throughput and end-to-end latency) and the quality of produced query results (in terms of, e.g., accuracy and completeness) are important. However, a DSPS often needs to make tradeoffs between these two requirements, either because of the data imperfection within the streams, or because of the limited computation capacity of the DSPS itself. Performance versus result-quality tradeoffs caused by data imperfection are inevitable, because the quality of the incoming data is beyond the control of a DSPS, whereas tradeoffs caused by system limitations can be alleviated—even erased—by enhancing the DSPS itself. This dissertation seeks to advance the state of the art on handling the performance versus result-quality tradeoffs in data stream processing caused by the above two aspects of reasons. For tradeoffs caused by data imperfection, this dissertation focuses on the typical data-imperfection problem of stream disorder and proposes the concept of quality-driven disorder handling (QDDH). QDDH enables a DSPS to make flexible and user-configurable tradeoffs between the end-to-end latency and the query-result quality when dealing with stream disorder. Moreover, compared to existing disorder handling approaches, QDDH can significantly reduce the end-to-end latency, and at the same time provide users with desired query-result quality. In this dissertation, a generic buffer-based QDDH framework and three instantiations of the generic framework for distinct query types are presented. For tradeoffs caused by system limitations, this dissertation proposes a system-enhancement approach that combines the row-oriented and the column-oriented data layout and processing techniques in data stream processing to improve the throughput. To fully exploit the potential of such hybrid execution of continuous queries, a static, cost-based query optimizer is introduced. The optimizer works at the operator level and takes the unique property of execution plans of continuous queries—feasibility—into account. Datenstromverarbeitung Data Stream Processing ddc:004 rvk:ST 234 rvk:ST 277 rvk:ST 265
7	Elastic Data Stream Processing Heinze, Thomas 27 October 2021 (has links) Data stream processing systems are used to process data from high velocity data sources like financial, sensor, or logistics data. Many use cases force these systems to use a distributed setup to be able to fulfill the strict requirements regarding expected system throughput and end-to-end latency. The major challenge for a distributed data stream processing system is unpredictable load peaks. Most systems use overprovisioning to solve this problem, which leads to a low system utilization and high monetary cost for the user. This doctoral thesis studies a potential solution to this problem by automatic scaling in or out based on the changing workload. This approach is called elastic scaling and allows a cost-efficient execution of the system with a high quality of service. In this thesis, we present our elastic scaling data stream processing system FUGU and address three major challenges of such systems: 1) consideration of user-defined end-to-end latency constraints during the elastic scaling, 2) study of different auto-scaling techniques, and 3) combination of elastic scaling with different fault tolerance techniques. First, we demonstrate how our system considers user-defined end-to-end latency constraints during the scaling decisions. Each scaling decision causes short latency peaks, because the processing needs to be paused while operators are moved. FUGU estimates the latency peaks for different scaling decisions, tries to minimize the created latency peak and at the same time to achieve similar monetary costs like alternative approaches. Second, we study different auto-scaling techniques for elastic-scaling data stream processing systems. Auto-scaling techniques are a very important part of such systems as they derive the scaling decisions. In this thesis, we study three auto-scaling techniques: Threshold-based Scaling, Reinforcement Learning and the novel Online Parameter Optimization. The Online Parameter Optimization overcomes the shortcomings of the two other approaches by avoiding manual tuning and being robust towards different workload patterns. Finally, we present an integration of an elastic scaling with different replication techniques for high availability to allow to minimize the spent monetary cost and to ensure at the same time a maximal recovery time. We leverage two replication approaches in FUGU and evaluate a trade-off between recovery time and overhead. FUGU estimates the recovery time and adaptively optimizes the used replication technique for each operator. All these contributions are carefully evaluated in three real-world scenarios and we discuss the relationship of our contributions towards related work. Datenströme, Elastische Skalierung info:eu-repo/classification/ddc/004 ddc:004
8	Handling Tradeoffs between Performance and Query-Result Quality in Data Stream Processing Ji, Yuanzhen 28 November 2017 (has links) Data streams in the form of potentially unbounded sequences of tuples arise naturally in a large variety of domains including finance markets, sensor networks, social media, and network traffic management. The increasing number of applications that require processing data streams with high throughput and low latency have promoted the development of data stream processing systems (DSPS). A DSPS processes data streams with continuous queries, which are issued once and return query results to users continuously as new tuples arrive. For stream-based applications, both the query-execution performance (in terms of, e.g., throughput and end-to-end latency) and the quality of produced query results (in terms of, e.g., accuracy and completeness) are important. However, a DSPS often needs to make tradeoffs between these two requirements, either because of the data imperfection within the streams, or because of the limited computation capacity of the DSPS itself. Performance versus result-quality tradeoffs caused by data imperfection are inevitable, because the quality of the incoming data is beyond the control of a DSPS, whereas tradeoffs caused by system limitations can be alleviated—even erased—by enhancing the DSPS itself. This dissertation seeks to advance the state of the art on handling the performance versus result-quality tradeoffs in data stream processing caused by the above two aspects of reasons. For tradeoffs caused by data imperfection, this dissertation focuses on the typical data-imperfection problem of stream disorder and proposes the concept of quality-driven disorder handling (QDDH). QDDH enables a DSPS to make flexible and user-configurable tradeoffs between the end-to-end latency and the query-result quality when dealing with stream disorder. Moreover, compared to existing disorder handling approaches, QDDH can significantly reduce the end-to-end latency, and at the same time provide users with desired query-result quality. In this dissertation, a generic buffer-based QDDH framework and three instantiations of the generic framework for distinct query types are presented. For tradeoffs caused by system limitations, this dissertation proposes a system-enhancement approach that combines the row-oriented and the column-oriented data layout and processing techniques in data stream processing to improve the throughput. To fully exploit the potential of such hybrid execution of continuous queries, a static, cost-based query optimizer is introduced. The optimizer works at the operator level and takes the unique property of execution plans of continuous queries—feasibility—into account. info:eu-repo/classification/ddc/004 ddc:004 Datenstromverarbeitung Data Stream Processing
9	Hardware Utilisation Techniques for Data Stream Processing Meldrum, Max January 2019 (has links) Recent years have seen an increase in use of the stream processing architecture to compose continuous analytics applications. This thesis presents the design of a Rust-based stream processor that adopts two separate techniques to tackle existing weaknesses in modern production-grade stream processors. The first technique employs a data analytics language on top of the streaming runtime, in order to provide both dataflow and low-level compiler optimisations. This technique is motivated by an analysis of the impact that the lack of compiler integration may have on the end-to-end performance of streaming pipelines in Apache Flink. In the second technique streaming operators are scheduled using a task-parallel approach to boost performance for skewed data distributions. The experimental results for data-parallel streaming pipelines in this thesis demonstrate, that the scheduling model of the prototype achieves performance improvements in skewed scenarios without exhibiting any significant losses in performance during uniform distributions. / Under senare år har användningen av strömbearbetningsarkitekturen ökat för att komponera kontinuerliga analysapplikationer. Denna avhandling presenterar designen av en Rust-baserad strömprocessor som använder två separata tekniker för att hantera befintliga svagheter i moderna strömprocessorer. Den första tekniken använder ett dataanalysspråk ovanpå strömprocessorn, för att ge både dataflöde och kompilatoroptimeringar på låg nivå. Denna teknik är motiverad av en analys av påverkan som bristen på kompilatorintegration kan ha på den slutliga prestandan för analysapplikationer i Apache Flink. I den andra tekniken schemaläggs strömningsoperatörer med hjälp av en uppgiftsparallell metod för att öka prestanda för skev datadistribution. De experimentella resultaten för data-parallella analysapplikationer i denna avhandling visar att schemaläggningsmodellen för prototypen uppnår prestandaförbättringar i ojämna distributioner utan att uppvisa några betydande förluster i prestanda under enhetliga fördelningar. Data Analytics Data Stream Processing Distributed Systems Computer and Information Sciences Data- och informationsvetenskap
10	Quality-of-Service-Aware Data Stream Processing Schmidt, Sven 13 March 2007 (has links) Data stream processing in the industrial as well as in the academic field has gained more and more importance during the last years. Consider the monitoring of industrial processes as an example. There, sensors are mounted to gather lots of data within a short time range. Storing and post-processing these data may occasionally be useless or even impossible. On the one hand, only a small part of the monitored data is relevant. To efficiently use the storage capacity, only a preselection of the data should be considered. On the other hand, it may occur that the volume of incoming data is generally too high to be stored in time or–in other words–the technical efforts for storing the data in time would be out of scale. Processing data streams in the context of this thesis means to apply database operations to the stream in an on-the-fly manner (without explicitly storing the data). The challenges for this task lie in the limited amount of resources while data streams are potentially infinite. Furthermore, data stream processing must be fast and the results have to be disseminated as soon as possible. This thesis focuses on the latter issue. The goal is to provide a so-called Quality-of-Service (QoS) for the data stream processing task. Therefore, adequate QoS metrics like maximum output delay or minimum result data rate are defined. Thereafter, a cost model for obtaining the required processing resources from the specified QoS is presented. On that basis, the stream processing operations are scheduled. Depending on the required QoS and on the available resources, the weight can be shifted among the individual resources and QoS metrics, respectively. Calculating and scheduling resources requires a lot of expert knowledge regarding the characteristics of the stream operations and regarding the incoming data streams. Often, this knowledge is based on experience and thus, a revision of the resource calculation and reservation becomes necessary from time to time. This leads to occasional interruptions of the continuous data stream processing, of the delivery of the result, and thus, of the negotiated Quality-of-Service. The proposed robustness concept supports the user and facilitates a decrease in the number of interruptions by providing more resources. info:eu-repo/classification/ddc/004 ddc:004

Search results