Global ETD Search

1	Quality-of-Service-Aware Data Stream Processing Schmidt, Sven 21 March 2007 (has links) (PDF) Data stream processing in the industrial as well as in the academic field has gained more and more importance during the last years. Consider the monitoring of industrial processes as an example. There, sensors are mounted to gather lots of data within a short time range. Storing and post-processing these data may occasionally be useless or even impossible. On the one hand, only a small part of the monitored data is relevant. To efficiently use the storage capacity, only a preselection of the data should be considered. On the other hand, it may occur that the volume of incoming data is generally too high to be stored in time or–in other words–the technical efforts for storing the data in time would be out of scale. Processing data streams in the context of this thesis means to apply database operations to the stream in an on-the-fly manner (without explicitly storing the data). The challenges for this task lie in the limited amount of resources while data streams are potentially infinite. Furthermore, data stream processing must be fast and the results have to be disseminated as soon as possible. This thesis focuses on the latter issue. The goal is to provide a so-called Quality-of-Service (QoS) for the data stream processing task. Therefore, adequate QoS metrics like maximum output delay or minimum result data rate are defined. Thereafter, a cost model for obtaining the required processing resources from the specified QoS is presented. On that basis, the stream processing operations are scheduled. Depending on the required QoS and on the available resources, the weight can be shifted among the individual resources and QoS metrics, respectively. Calculating and scheduling resources requires a lot of expert knowledge regarding the characteristics of the stream operations and regarding the incoming data streams. Often, this knowledge is based on experience and thus, a revision of the resource calculation and reservation becomes necessary from time to time. This leads to occasional interruptions of the continuous data stream processing, of the delivery of the result, and thus, of the negotiated Quality-of-Service. The proposed robustness concept supports the user and facilitates a decrease in the number of interruptions by providing more resources. data stream processing quality-of-service robustness Datenstromverarbeitung Qualität Robustheit ddc:004 rvk:ST 274 Datenstrom Datenverarbeitung Dienstgüte
2	Handling Tradeoffs between Performance and Query-Result Quality in Data Stream Processing Ji, Yuanzhen 27 March 2018 (has links) (PDF) Data streams in the form of potentially unbounded sequences of tuples arise naturally in a large variety of domains including finance markets, sensor networks, social media, and network traffic management. The increasing number of applications that require processing data streams with high throughput and low latency have promoted the development of data stream processing systems (DSPS). A DSPS processes data streams with continuous queries, which are issued once and return query results to users continuously as new tuples arrive. For stream-based applications, both the query-execution performance (in terms of, e.g., throughput and end-to-end latency) and the quality of produced query results (in terms of, e.g., accuracy and completeness) are important. However, a DSPS often needs to make tradeoffs between these two requirements, either because of the data imperfection within the streams, or because of the limited computation capacity of the DSPS itself. Performance versus result-quality tradeoffs caused by data imperfection are inevitable, because the quality of the incoming data is beyond the control of a DSPS, whereas tradeoffs caused by system limitations can be alleviated—even erased—by enhancing the DSPS itself. This dissertation seeks to advance the state of the art on handling the performance versus result-quality tradeoffs in data stream processing caused by the above two aspects of reasons. For tradeoffs caused by data imperfection, this dissertation focuses on the typical data-imperfection problem of stream disorder and proposes the concept of quality-driven disorder handling (QDDH). QDDH enables a DSPS to make flexible and user-configurable tradeoffs between the end-to-end latency and the query-result quality when dealing with stream disorder. Moreover, compared to existing disorder handling approaches, QDDH can significantly reduce the end-to-end latency, and at the same time provide users with desired query-result quality. In this dissertation, a generic buffer-based QDDH framework and three instantiations of the generic framework for distinct query types are presented. For tradeoffs caused by system limitations, this dissertation proposes a system-enhancement approach that combines the row-oriented and the column-oriented data layout and processing techniques in data stream processing to improve the throughput. To fully exploit the potential of such hybrid execution of continuous queries, a static, cost-based query optimizer is introduced. The optimizer works at the operator level and takes the unique property of execution plans of continuous queries—feasibility—into account. Datenstromverarbeitung Data Stream Processing ddc:004 rvk:ST 234 rvk:ST 277 rvk:ST 265
3	Handling Tradeoffs between Performance and Query-Result Quality in Data Stream Processing Ji, Yuanzhen 28 November 2017 (has links) Data streams in the form of potentially unbounded sequences of tuples arise naturally in a large variety of domains including finance markets, sensor networks, social media, and network traffic management. The increasing number of applications that require processing data streams with high throughput and low latency have promoted the development of data stream processing systems (DSPS). A DSPS processes data streams with continuous queries, which are issued once and return query results to users continuously as new tuples arrive. For stream-based applications, both the query-execution performance (in terms of, e.g., throughput and end-to-end latency) and the quality of produced query results (in terms of, e.g., accuracy and completeness) are important. However, a DSPS often needs to make tradeoffs between these two requirements, either because of the data imperfection within the streams, or because of the limited computation capacity of the DSPS itself. Performance versus result-quality tradeoffs caused by data imperfection are inevitable, because the quality of the incoming data is beyond the control of a DSPS, whereas tradeoffs caused by system limitations can be alleviated—even erased—by enhancing the DSPS itself. This dissertation seeks to advance the state of the art on handling the performance versus result-quality tradeoffs in data stream processing caused by the above two aspects of reasons. For tradeoffs caused by data imperfection, this dissertation focuses on the typical data-imperfection problem of stream disorder and proposes the concept of quality-driven disorder handling (QDDH). QDDH enables a DSPS to make flexible and user-configurable tradeoffs between the end-to-end latency and the query-result quality when dealing with stream disorder. Moreover, compared to existing disorder handling approaches, QDDH can significantly reduce the end-to-end latency, and at the same time provide users with desired query-result quality. In this dissertation, a generic buffer-based QDDH framework and three instantiations of the generic framework for distinct query types are presented. For tradeoffs caused by system limitations, this dissertation proposes a system-enhancement approach that combines the row-oriented and the column-oriented data layout and processing techniques in data stream processing to improve the throughput. To fully exploit the potential of such hybrid execution of continuous queries, a static, cost-based query optimizer is introduced. The optimizer works at the operator level and takes the unique property of execution plans of continuous queries—feasibility—into account. info:eu-repo/classification/ddc/004 ddc:004 Datenstromverarbeitung Data Stream Processing
4	Approximate Data Analytics Systems Le Quoc, Do 22 March 2018 (has links) (PDF) Today, most modern online services make use of big data analytics systems to extract useful information from the raw digital data. The data normally arrives as a continuous data stream at a high speed and in huge volumes. The cost of handling this massive data can be significant. Providing interactive latency in processing the data is often impractical due to the fact that the data is growing exponentially and even faster than Moore’s law predictions. To overcome this problem, approximate computing has recently emerged as a promising solution. Approximate computing is based on the observation that many modern applications are amenable to an approximate, rather than the exact output. Unlike traditional computing, approximate computing tolerates lower accuracy to achieve lower latency by computing over a partial subset instead of the entire input data. Unfortunately, the advancements in approximate computing are primarily geared towards batch analytics and cannot provide low-latency guarantees in the context of stream processing, where new data continuously arrives as an unbounded stream. In this thesis, we design and implement approximate computing techniques for processing and interacting with high-speed and large-scale stream data to achieve low latency and efficient utilization of resources. To achieve these goals, we have designed and built the following approximate data analytics systems: • StreamApprox—a data stream analytics system for approximate computing. This system supports approximate computing for low-latency stream analytics in a transparent way and has an ability to adapt to rapid fluctuations of input data streams. In this system, we designed an online adaptive stratified reservoir sampling algorithm to produce approximate output with bounded error. • IncApprox—a data analytics system for incremental approximate computing. This system adopts approximate and incremental computing in stream processing to achieve high-throughput and low-latency with efficient resource utilization. In this system, we designed an online stratified sampling algorithm that uses self-adjusting computation to produce an incrementally updated approximate output with bounded error. • PrivApprox—a data stream analytics system for privacy-preserving and approximate computing. This system supports high utility and low-latency data analytics and preserves user’s privacy at the same time. The system is based on the combination of privacy-preserving data analytics and approximate computing. • ApproxJoin—an approximate distributed joins system. This system improves the performance of joins — critical but expensive operations in big data systems. In this system, we employed a sketching technique (Bloom filter) to avoid shuffling non-joinable data items through the network as well as proposed a novel sampling mechanism that executes during the join to obtain an unbiased representative sample of the join output. Our evaluation based on micro-benchmarks and real world case studies shows that these systems can achieve significant performance speedup compared to state-of-the-art systems by tolerating negligible accuracy loss of the analytics output. In addition, our systems allow users to systematically make a trade-off between accuracy and throughput/latency and require no/minor modifications to the existing applications. Approximate computing big data Probenahme Datenstromverarbeitung Approximate computing big data sampling stream processing ddc:004 rvk:ST 234 rvk:ST 230
5	Approximate Data Analytics Systems Le Quoc, Do 22 January 2018 (has links) Today, most modern online services make use of big data analytics systems to extract useful information from the raw digital data. The data normally arrives as a continuous data stream at a high speed and in huge volumes. The cost of handling this massive data can be significant. Providing interactive latency in processing the data is often impractical due to the fact that the data is growing exponentially and even faster than Moore’s law predictions. To overcome this problem, approximate computing has recently emerged as a promising solution. Approximate computing is based on the observation that many modern applications are amenable to an approximate, rather than the exact output. Unlike traditional computing, approximate computing tolerates lower accuracy to achieve lower latency by computing over a partial subset instead of the entire input data. Unfortunately, the advancements in approximate computing are primarily geared towards batch analytics and cannot provide low-latency guarantees in the context of stream processing, where new data continuously arrives as an unbounded stream. In this thesis, we design and implement approximate computing techniques for processing and interacting with high-speed and large-scale stream data to achieve low latency and efficient utilization of resources. To achieve these goals, we have designed and built the following approximate data analytics systems: • StreamApprox—a data stream analytics system for approximate computing. This system supports approximate computing for low-latency stream analytics in a transparent way and has an ability to adapt to rapid fluctuations of input data streams. In this system, we designed an online adaptive stratified reservoir sampling algorithm to produce approximate output with bounded error. • IncApprox—a data analytics system for incremental approximate computing. This system adopts approximate and incremental computing in stream processing to achieve high-throughput and low-latency with efficient resource utilization. In this system, we designed an online stratified sampling algorithm that uses self-adjusting computation to produce an incrementally updated approximate output with bounded error. • PrivApprox—a data stream analytics system for privacy-preserving and approximate computing. This system supports high utility and low-latency data analytics and preserves user’s privacy at the same time. The system is based on the combination of privacy-preserving data analytics and approximate computing. • ApproxJoin—an approximate distributed joins system. This system improves the performance of joins — critical but expensive operations in big data systems. In this system, we employed a sketching technique (Bloom filter) to avoid shuffling non-joinable data items through the network as well as proposed a novel sampling mechanism that executes during the join to obtain an unbiased representative sample of the join output. Our evaluation based on micro-benchmarks and real world case studies shows that these systems can achieve significant performance speedup compared to state-of-the-art systems by tolerating negligible accuracy loss of the analytics output. In addition, our systems allow users to systematically make a trade-off between accuracy and throughput/latency and require no/minor modifications to the existing applications. info:eu-repo/classification/ddc/004 ddc:004
6	Quality-of-Service-Aware Data Stream Processing Schmidt, Sven 13 March 2007 (has links) Data stream processing in the industrial as well as in the academic field has gained more and more importance during the last years. Consider the monitoring of industrial processes as an example. There, sensors are mounted to gather lots of data within a short time range. Storing and post-processing these data may occasionally be useless or even impossible. On the one hand, only a small part of the monitored data is relevant. To efficiently use the storage capacity, only a preselection of the data should be considered. On the other hand, it may occur that the volume of incoming data is generally too high to be stored in time or–in other words–the technical efforts for storing the data in time would be out of scale. Processing data streams in the context of this thesis means to apply database operations to the stream in an on-the-fly manner (without explicitly storing the data). The challenges for this task lie in the limited amount of resources while data streams are potentially infinite. Furthermore, data stream processing must be fast and the results have to be disseminated as soon as possible. This thesis focuses on the latter issue. The goal is to provide a so-called Quality-of-Service (QoS) for the data stream processing task. Therefore, adequate QoS metrics like maximum output delay or minimum result data rate are defined. Thereafter, a cost model for obtaining the required processing resources from the specified QoS is presented. On that basis, the stream processing operations are scheduled. Depending on the required QoS and on the available resources, the weight can be shifted among the individual resources and QoS metrics, respectively. Calculating and scheduling resources requires a lot of expert knowledge regarding the characteristics of the stream operations and regarding the incoming data streams. Often, this knowledge is based on experience and thus, a revision of the resource calculation and reservation becomes necessary from time to time. This leads to occasional interruptions of the continuous data stream processing, of the delivery of the result, and thus, of the negotiated Quality-of-Service. The proposed robustness concept supports the user and facilitates a decrease in the number of interruptions by providing more resources. info:eu-repo/classification/ddc/004 ddc:004
7	Datenqualität in Sensordatenströmen / Data Quality in Sensor Data Streams Klein, Anja 23 March 2010 (has links) (PDF) Die stetige Entwicklung intelligenter Sensorsysteme erlaubt die Automatisierung und Verbesserung komplexer Prozess- und Geschäftsentscheidungen in vielfältigen Anwendungsszenarien. Sensoren können zum Beispiel zur Bestimmung optimaler Wartungstermine oder zur Steuerung von Produktionslinien genutzt werden. Ein grundlegendes Problem bereitet dabei die Sensordatenqualität, die durch Umwelteinflüsse und Sensorausfälle beschränkt wird. Ziel der vorliegenden Arbeit ist die Entwicklung eines Datenqualitätsmodells, das Anwendungen und Datenkonsumenten Qualitätsinformationen für eine umfassende Bewertung unsicherer Sensordaten zur Verfügung stellt. Neben Datenstrukturen zur effizienten Datenqualitätsverwaltung in Datenströmen und Datenbanken wird eine umfassende Datenqualitätsalgebra zur Berechnung der Qualität von Datenverarbeitungsergebnissen vorgestellt. Darüber hinaus werden Methoden zur Datenqualitätsverbesserung entwickelt, die speziell auf die Anforderungen der Sensordatenverarbeitung angepasst sind. Die Arbeit wird durch Ansätze zur nutzerfreundlichen Datenqualitätsanfrage und -visualisierung vervollständigt. Datenqualität Datenstromverarbeitung Sensordaten Intelligente Systeme Datenbank Optimierung Data Quality Data Stream Data Stream Processing Database Sensor Data Smart Items Optimization ddc:004 rvk:ST 265 rvk:ST 270
8	Laufzeitadaption von zustandsbehafteten Datenstromoperatoren Wolf, Bernhard 04 December 2013 (has links) (PDF) Änderungen von Datenstromanfragen zur Laufzeit werden insbesondere durch zustandsbehaftete Datenstromoperatoren erschwert. Da die Zustände im Arbeitsspeicher abgelegt sind und bei einem Neustart verloren gehen, wurden in der Vergangenheit Migrationsverfahren entwickelt, um die inneren Operatorzustände bei einem Änderungsvorgang zu erhalten. Die Migrationsverfahren basieren auf zwei unterschiedlichen Ansätzen - Zustandstransfer und Parallelausführung - sind jedoch aufgrund ihrer Realisierung auf eine zentrale Ausführung beschränkt. Mit wachsenden Anforderungen in Bezug auf Datenmengen und Antwortzeiten werden Datenstromsysteme vermehrt verteilt ausgeführt, beispielsweise durch Sensornetze oder verteilte IT-Systeme. Zur Anpassung der Anfragen zur Laufzeit sind existierende Migrationsstrategien nicht oder nur bedingt geeignet. Diese Arbeit leistet einen Beitrag zur Lösung dieser Problematik und zur Optimierung der Migration in Datenstromsystemen. Am Beispiel von präventiven Instandhaltungsstrategien in Fabrikumgebungen werden Anforderungen für die Datenstromverarbeitung und insbesondere für die Migration abgeleitet. Das generelle Ziel ist demnach eine möglichst schnelle Migration bei gleichzeitiger Ergebnisausgabe. In einer detaillierten Analyse der existierenden Migrationsstrategien werden deren Stärken und Schwächen bezüglich der gestellten Anforderungen diskutiert. Für die Adaption von laufenden Datenstromanfragen wird eine allgemeine Methodik vorgestellt, welche als Basis für die neuen Strategien dient. Diese Adaptionsmethodik unterstützt zwei Verfahren zur Bestimmung von Migrationskonfigurationen - ein numerisches Verfahren für periodische Datenströme und ein heuristisches Verfahren, welches auch auf aperiodische Datenströme angewendet werden kann. Eine wesentliche Funktionalität zur Minimierung der Migrationsdauer ist dabei die Beschränkung auf notwendige Zustandswerte, da in verteilten Umgebungen eine Übertragungszeit für den Zustandstransfer veranschlagt werden muss - zwei Aspekte, die bei existierenden Verfahren nicht berücksichtigt werden. Durch die Verwendung von neu entwickelten Zustandstransfermethoden kann zudem die Übertragungsreihenfolge der einzelnen Zustandswerte beeinflusst werden. Die Konzepte wurden in einem OSGi-basierten Prototyp implementiert und zudem simulativ analysiert. Mit einer umfassenden Evaluierung wird die Funktionsfähigkeit aller Komponenten und Konzepte demonstriert. Der Performance-Vergleich zwischen den existierenden und den neuen Migrationsstrategien fällt deutlich zu Gunsten der neuen Strategien aus, die zudem in der Lage sind, alle Anforderungen zu erfüllen. Datenstrom Datenstromverarbeitung Zustandstransfer Migration Zustandsmigration Zustandserhaltung data stream processing data stream management state transfer operator migration state migration ddc:004 rvk:ST 200 rvk:ST 274 Datenstrom Datenbank
9	Leistungsoptimierung der persistenten Datenverwaltung in DSP-Architekturen zur Live-Analyse von Sensordaten Weißbach, Manuel 28 October 2021 (has links) Aufgrund der in vielen Bereichen stets wachsenden Menge an zu verarbeitenden Daten haben sich Big-Data-Anwendungen in den letzten Jahren zunehmend verbreitet. Twitter gab bereits im Jahr 2011 an, täglich 15 Millionen URLs in Echtzeit zu untersuchen, um die Verbreitung von Spamlinks zu unterbinden [1]. Facebook verarbeitet pro Minute über vier Millionen „Gefällt mir“-Klicks und verwaltet über 300 Petabyte Daten [2]. Über das Businessportal LinkedIn wurden 2011 rund eine Milliarde Nachrichten pro Tag zugestellt, 2015 waren es laut Angaben des Unternehmens bereits 1,1 Billionen täglich versendete Nachrichten [3]. Diesem starken Anstieg liegt ein exponentielles Wachstum zugrunde, das für Big Data typisch ist. Gartner definiert den Begriff „Big Data“ auf Basis seiner spezifischen Eigenschaften, die in englischer Sprache auch als die „drei V´s“ bezeichnet werden: „Volume“, „Variety“ und „Velocity“ [4]. Neben der enormen Menge an zu verarbeitenden Daten („Volume“) und ihrer Vielfalt und Unstrukturiertheit („Variety“), ist demnach auch die Geschwindigkeit („Velocity“), in der die Daten generiert werden, ein wesentliches Merkmal von Big Data [5, 6]. Soll trotz der ständigen und immer schneller werdenden Generierung neuer Daten ein Verarbeitungsrückstau vermieden werden, so folgt daraus auch die Notwendigkeit, die kontinuierlich wachsenden Datenmengen immer schneller zu verarbeiten. info:eu-repo/classification/ddc/004 ddc:004
10	Performance Optimizations and Operator Semantics for Streaming Data Flow Programs Sax, Matthias J. 01 July 2020 (has links) Unternehmen sammeln mehr Daten als je zuvor und müssen auf diese Informationen zeitnah reagieren. Relationale Datenbanken eignen sich nicht für die latenzfreie Verarbeitung dieser oft unstrukturierten Daten. Um diesen Anforderungen zu begegnen, haben sich in der Datenbankforschung seit dem Anfang der 2000er Jahre zwei neue Forschungsrichtungen etabliert: skalierbare Verarbeitung unstrukturierter Daten und latenzfreie Datenstromverarbeitung. Skalierbare Verarbeitung unstrukturierter Daten, auch bekannt unter dem Begriff "Big Data"-Verarbeitung, hat in der Industrie schnell Einzug erhalten. Gleichzeitig wurden in der Forschung Systeme zur latenzfreien Datenstromverarbeitung entwickelt, die auf eine verteilte Architektur, Skalierbarkeit und datenparallele Verarbeitung setzen. Obwohl diese Systeme in der Industrie vermehrt zum Einsatz kommen, gibt es immer noch große Herausforderungen im praktischen Einsatz. Diese Dissertation verfolgt zwei Hauptziele: Zuerst wird das Laufzeitverhalten von hochskalierbaren datenparallelen Datenstromverarbeitungssystemen untersucht. Im zweiten Hauptteil wird das "Dual Streaming Model" eingeführt, das eine Semantik zur gleichzeitigen Verarbeitung von Datenströmen und Tabellen beschreibt. Das Ziel unserer Untersuchung ist ein besseres Verständnis über das Laufzeitverhalten dieser Systeme zu erhalten und dieses Wissen zu nutzen um Anfragen automatisch ausreichende Rechenkapazität zuzuweisen. Dazu werden ein Kostenmodell und darauf aufbauende Optimierungsalgorithmen für Datenstromanfragen eingeführt, die Datengruppierung und Datenparallelität einbeziehen. Das vorgestellte Datenstromverarbeitungsmodell beschreibt das Ergebnis eines Operators als kontinuierlichen Strom von Veränderugen auf einer Ergebnistabelle. Dabei behandelt unser Modell die Diskrepanz der physikalischen und logischen Ordnung von Datenelementen inhärent und erreicht damit eine deterministische Semantik und eine minimale Verarbeitungslatenz. / Modern companies are able to collect more data and require insights from it faster than ever before. Relational databases do not meet the requirements for processing the often unstructured data sets with reasonable performance. The database research community started to address these trends in the early 2000s. Two new research directions have attracted major interest since: large-scale non-relational data processing as well as low-latency data stream processing. Large-scale non-relational data processing, commonly known as "Big Data" processing, was quickly adopted in the industry. In parallel, low latency data stream processing was mainly driven by the research community developing new systems that embrace a distributed architecture, scalability, and exploits data parallelism. While these systems have gained more and more attention in the industry, there are still major challenges to operate them at large scale. The goal of this dissertation is two-fold: First, to investigate runtime characteristics of large scale data-parallel distributed streaming systems. And second, to propose the "Dual Streaming Model" to express semantics of continuous queries over data streams and tables. Our goal is to improve the understanding of system and query runtime behavior with the aim to provision queries automatically. We introduce a cost model for streaming data flow programs taking into account the two techniques of record batching and data parallelization. Additionally, we introduce optimization algorithms that leverage our model for cost-based query provisioning. The proposed Dual Streaming Model expresses the result of a streaming operator as a stream of successive updates to a result table, inducing a duality between streams and tables. Our model handles the inconsistency of the logical and the physical order of records within a data stream natively, which allows for deterministic semantics as well as low latency query execution. Datenstromverarbeitung Datenflussprogram Parallelität Optimierung Verarbeitungssemantik Data Stream Processing Data Flow Program Parallelization Optimization Processing Semantics 004 Informatik ST 265 ddc:004

Search results