Global ETD Search

71	Dynamic Configuration of a Relocatable Driver and Code Generator for Continuous Deep Analytics / Dynamisk Konfigurering av en Omlokaliseringsbar Driver och Kod Genererare för Continuous Deep Analytics Bjuhr, Oscar January 2018 (has links) Modern stream processing engines usually use the Java virtual machine (JVM) as execution platform. The JVM increases portability and safety of applications at the cost of not fully utilising the performance of the physical machines. Being able to use hardware accelerators such as GPUs for computationally heavy analysis of data streams is also restricted when using the JVM. The project Continuous Deep Analytics (CDA) explores the possibility of a stream processor executing native code directly on the underlying hardware using Rust. Rust is a young programming language which can statically guarantee the absence of memory errors and data races in programs without incurring performance penalties during runtime. Rust is built on top of LLVM which gives Rust a theoretical possibility to compile to a large set of target platforms. Each specific target platform does however require a specific configured runtime environment for Rust’s compiler to work properly. The CDA compiler will run in a distributed setting where the compiler has to be able to reallocate to different nodes to handle node failures. Setting up a reassignable Rust compiler in such a setting can be error prone and Docker is explored as a solution to this problem. A concurrent thread based system is implemented in Scala for building Docker images and compiling Rust in containers. Docker shows a potential of enabling easy reallocation of the driver without manual configuration. Docker has no major effect on Rust’s compile time. The large Docker images required to compile Rust is a drawback of the solution. They will require substantial network traffic to reallocate the driver. Reducing the size of the images would therefore make the solution more responsive. / Moderna strömprocessorer använder vanligtvis Javas virtuella maskin (JVM) som plattform för exekvering. Det gör strömprocessorerna portabla och säkra men begränsar hur väl de kan använda kapaciteten i den underliggande fysiska maskinen. Att kunna använda sig av hårdvaruaccelerator som t.ex. grafikkort för tung beräkning och analys av dataströmmar är en anledning till varför projektet Continuous Deep Analytics (CDA) utforskar möjligheten att istället exekvera en strömprocessor direkt i den underliggande maskinen. Rust är ett ungt programmeringsspråk som statiskt kan garantera att program inte innehåller minnesfel eller race conditions", detta utan att negativt påverka prestanda vid exekvering. Rust är byggt på LLVM vilket ger Rust en teoretisk möjlighet att kompilera till en stor mängd olika maskinarkitekturer. Varje specifik maskinarkitektur kräver dock att kompileringsmiljön är konfigurerad på ett specifikt sätt. CDAs kompilator kommer befinna sig i ett distribuerat system där kompilatorn kan bli flyttad till olika maskiner för att kunna hantera maskinfel. Att dynamiskt konfigurera kompilatorn i en sådan miljö kan leda till problem och därför testas Docker som en lösning på problemet. Ett trådbaserat system för parallell exekvering är implementerat i Scala för att bygga Docker bilder och kompilera Rust i containrar. Docker visar sig att ha en potential för att möjliggöra lätt omallokering av drivern utan manuell konfiguration. Docker har ingen stor påverkan på Rusts kompileringstid. De stora storlekarna på de Docker bilder som krävs för att kompilera Rust är en nackdel med lösningen. De gör att om allokering av drivern kräver mycket nätverkstrafik och kan därför ta lång tid. För att göra lösningen kvickare kan storleken av bilderna reduceras. Stream Processing Heterogeneous Cluster Big Data Rust Cargo Docker Ström Processor Heterogent Kluster Big Data Rust Cargo Docker Computer and Information Sciences Data- och informationsvetenskap
72	Tiered Storage Architecture for Stream Processing Systems Song, Ao January 2022 (has links) State management is the key component of a stateful stream processing system. Normally a stream processing system supports either an embedded state backend or an external backend. An embedded backend will store the data locally on the computing node. It is efficient to read and write but will lose the capability to scale out the system. An external backend could solve the problem but the performance will be compromised since the operations on the data have to go over the network. This project aims to find an approach combining the advantages of both embedded and external backend, a tiered storage system has been proposed to solve this issue at the cost of more storage space. The proposed solution consists of three layers, ephemeral layer, embedded layer, and external layer. The ephemeral layer and embedded layer reside on the local node while the external layer resides on an external node to make the system reconfigurable. The application will always retrieve data from the ephemeral layer first for different kinds of operations, then the embedded layer will be consulted if no result is found in the ephemeral layer, and finally the external layer will be visited. Based on this design principle, the operations in the state will be conducted locally as much as possible to be efficient. This point has been proved in the project by evaluating the performance with different key distributions. The experiments result shows the tiered storage system can provide good performance with the capability of system reconfiguration. / Tillståndshantering är nyckelkomponenten i ett tillståndsbestämt strömbehandlingssystem. Normalt stöder ett strömbehandlingssystem antingen en inbäddad tillståndsbackend eller en extern backend. En inbäddad backend kommer att lagra data lokalt på datornoden. Det är effektivt att läsa och skriva men kommer att förlora förmågan att skala ut systemet. En extern backend kan lösa problemet men prestandan kommer att äventyras eftersom operationerna på datan måste gå över nätverket. Detta projekt syftar till att hitta ett tillvägagångssätt som kombinerar fördelarna med både inbäddad och extern backend, ett lagringssystem i nivåer har föreslagits för att lösa detta problem till priset av mer lagringsutrymme. Den föreslagna lösningen består av tre lager, kortvarigt lager, inbäddat lager och externt lager. Det tillfälliga lagret och det inbäddade lagret finns på den lokala noden medan det externa lagret finns på en extern nod för att göra systemet omkonfigurerbart. Applikationen kommer alltid att hämta data från det efemära lagret först för olika typer av operationer, sedan kommer det inbäddade lagret att konsulteras om inget resultat hittas i det efemära lagret och slutligen kommer det externa lagret att besökas. Utifrån denna designprincip kommer verksamheten på staten att bedrivas lokalt så mycket som möjligt för att vara effektiv. Denna poäng har bevisats i projektet genom att utvärdera prestandan med olika nyckelfördelningar. Experimentresultatet visar att det nivåbaserade lagringssystemet kan ge bra prestanda med möjlighet till systemomkonfigurering. State Backend Stream Processing Embedded Backend External Backend Tiered Storage System State Backend Strömbehandling Inbäddad Backend Extern Backend Lagringssystem i Nivåer Computer and Information Sciences Data- och informationsvetenskap
73	Real-Time Failure Event Streaming of Continuous Integration Builds / Realtidsströmning av Felhändelser i Kontinuerlig Integration Seifert, Felix January 2022 (has links) An application build describes compiling and linking the source code of a developed application to libraries and executables. A Continuous Integration (CI) build executes such a build after the source code has been changed and tries to integrate the changes into the existing application. Such CI builds are executed automatically and include automated software tests, which give the developer the assurance that the changes are technically correct. When the time between the discovery of a test failure and the notification to the developer about it is too long, the development process will be impacted negatively and the beneficial effects of CI decrease. Even though several companies already have CI systems that display all events of a single CI build on a terminal during runtime, bigger applications often involve several CI builds in a single CI pipeline to integrate code changes. Observing the events of these CI builds during runtime might require concurrent monitoring of several different terminals. This thesis overcomes this issue by developing a Proof of Concept (PoC) which streams the test failures of a whole CI pipeline in real-time to the developer. To show the feasibility of real-time failure event streaming of CI builds, the PoC is implemented within Spotify’s CI for clientfacing applications. The issues highlighted by this initial PoC will help to refine the whole CI practice. Furthermore, the faster feedback cycles realised by this PoC will lead to a productivity, efficiency and happiness increase for the involved developers and, eventually, higher quality of the developed software. / Ett applikationsbygge beskriver kompilering och länkning av källkod för en utvecklad applikation till bibliotek och körbara filer. Ett Kontinuerlig Integrerings (CI)-bygge kör en sådan bygge efter att källkoden har ändrats och försöker integrera ändringarna i den befintliga applikationen. Sådana CIbyggen exekveras automatiskt och inkluderar automatiserade mjukvarutester, som ger utvecklaren en försäkran om att ändringarna är tekniskt korrekta. När tiden mellan upptäckten av ett testfel och meddelandet till utvecklaren om det är för lång kommer utvecklingsprocessen att påverkas negativt och de fördelaktiga effekterna av CI minskar. Även om flera företag redan har CIsystem som visar alla händelser av ett enskilt CI-bygge i en terminal under körning, involverar större applikationer ofta flera CI-byggen i en och samma CI-pipeline för att integrera kodändringar. Att observera händelserna i dessa CI-byggen under körning kan kräva jämlöpande övervakning av flera olika terminaler. Den här avhandlingen övervinner detta problem genom att utveckla en PoC som strömmar testfelen för en hel CI-pipeline i realtid till utvecklaren. För att visa genomförbarheten av strömning av felhändelser i realtid av CIbyggnader implementeras PoC i Spotifys CI för klientvända applikationer. De problem som lyfts fram av denna första PoC kommer att bidra till att förfina hela CI-praxisen. Dessutom kommer de snabbare återkopplingscyklerna som realiseras av denna PoCatt leda till ökad produktivitet, effektivitet och glädje för de inblandade utvecklarna och, så småningom, högre kvalitet på den utvecklade mjukvaran. Continuous Integration Build Streaming Stream Processing Systems Realtime systems Developer Productivity Engineering Kontinuerlig Integration Bygge Strömning Strömningssystem Realtidssystem Utvecklarproduktivitet Computer and Information Sciences Data- och informationsvetenskap
74	State Management for Efficient Event Pattern Detection Zhao, Bo 20 May 2022 (has links) Event Stream Processing (ESP) Systeme überwachen kontinuierliche Datenströme, um benutzerdefinierte Queries auszuwerten. Die Herausforderung besteht darin, dass die Queryverarbeitung zustandsbehaftet ist und die Anzahl von Teilübereinstimmungen mit der Größe der verarbeiteten Events exponentiell anwächst. Die Dynamik von Streams und die Notwendigkeit, entfernte Daten zu integrieren, erschweren die Zustandsverwaltung. Erstens liefern heterogene Eventquellen Streams mit unvorhersehbaren Eingaberaten und Queryselektivitäten. Während Spitzenzeiten ist eine erschöpfende Verarbeitung unmöglich, und die Systeme müssen auf eine Best-Effort-Verarbeitung zurückgreifen. Zweitens erfordern Queries möglicherweise externe Daten, um ein bestimmtes Event für eine Query auszuwählen. Solche Abhängigkeiten sind problematisch: Das Abrufen der Daten unterbricht die Stream-Verarbeitung. Ohne eine Eventauswahl auf Grundlage externer Daten wird das Wachstum von Teilübereinstimmungen verstärkt. In dieser Dissertation stelle ich Strategien für optimiertes Zustandsmanagement von ESP Systemen vor. Zuerst ermögliche ich eine Best-Effort-Verarbeitung mittels Load Shedding. Dabei werden sowohl Eingabeeevents als auch Teilübereinstimmungen systematisch verworfen, um eine Latenzschwelle mit minimalem Qualitätsverlust zu garantieren. Zweitens integriere ich externe Daten, indem ich das Abrufen dieser von der Verwendung in der Queryverarbeitung entkoppele. Mit einem effizienten Caching-Mechanismus vermeide ich Unterbrechungen durch Übertragungslatenzen. Dazu werden externe Daten basierend auf ihrer erwarteten Verwendung vorab abgerufen und mittels Lazy Evaluation bei der Eventauswahl berücksichtigt. Dabei wird ein Kostenmodell verwendet, um zu bestimmen, wann welche externen Daten abgerufen und wie lange sie im Cache aufbewahrt werden sollen. Ich habe die Effektivität und Effizienz der vorgeschlagenen Strategien anhand von synthetischen und realen Daten ausgewertet und unter Beweis gestellt. / Event stream processing systems continuously evaluate queries over event streams to detect user-specified patterns with low latency. However, the challenge is that query processing is stateful and it maintains partial matches that grow exponentially in the size of processed events. State management is complicated by the dynamicity of streams and the need to integrate remote data. First, heterogeneous event sources yield dynamic streams with unpredictable input rates, data distributions, and query selectivities. During peak times, exhaustive processing is unreasonable, and systems shall resort to best-effort processing. Second, queries may require remote data to select a specific event for a pattern. Such dependencies are problematic: Fetching the remote data interrupts the stream processing. Yet, without event selection based on remote data, the growth of partial matches is amplified. In this dissertation, I present strategies for optimised state management in event pattern detection. First, I enable best-effort processing with load shedding that discards both input events and partial matches. I carefully select the shedding elements to satisfy a latency bound while striving for a minimal loss in result quality. Second, to efficiently integrate remote data, I decouple the fetching of remote data from its use in query evaluation by a caching mechanism. To this end, I hide the transmission latency by prefetching remote data based on anticipated use and by lazy evaluation that postpones the event selection based on remote data to avoid interruptions. A cost model is used to determine when to fetch which remote data items and how long to keep them in the cache. I evaluated the above techniques with queries over synthetic and real-world data. I show that the load shedding technique significantly improves the recall of pattern detection over baseline approaches, while the technique for remote data integration significantly reduces the pattern detection latency. Datenstromverarbeitung Complex event processing Mustererkennung Datenbankmanagementsystem Data stream processing Complex event processing Pattern detection Database management systems 004 Informatik ST 265 ddc:004
75	A Novel Cloud Broker-based Resource Elasticity Management and Pricing for Big Data Streaming Applications Runsewe, Olubisi A. 28 May 2019 (has links) The pervasive availability of streaming data from various sources is driving todays’ enterprises to acquire low-latency big data streaming applications (BDSAs) for extracting useful information. In parallel, recent advances in technology have made it easier to collect, process and store these data streams in the cloud. For most enterprises, gaining insights from big data is immensely important for maintaining competitive advantage. However, majority of enterprises have diﬃculty managing the multitude of BDSAs and the complex issues cloud technologies present, giving rise to the incorporation of cloud service brokers (CSBs). Generally, the main objective of the CSB is to maintain the heterogeneous quality of service (QoS) of BDSAs while minimizing costs. To achieve this goal, the cloud, although with many desirable features, exhibits major challenges — resource prediction and resource allocation — for CSBs. First, most stream processing systems allocate a ﬁxed amount of resources at runtime, which can lead to under- or over-provisioning as BDSA demands vary over time. Thus, obtaining optimal trade-oﬀ between QoS violation and cost requires accurate demand prediction methodology to prevent waste, degradation or shutdown of processing. Second, coordinating resource allocation and pricing decisions for self-interested BDSAs to achieve fairness and eﬃciency can be complex. This complexity is exacerbated with the recent introduction of containers. This dissertation addresses the cloud resource elasticity management issues for CSBs as follows: First, we provide two contributions to the resource prediction challenge; we propose a novel layered multi-dimensional hidden Markov model (LMD-HMM) framework for managing time-bounded BDSAs and a layered multi-dimensional hidden semi-Markov model (LMD-HSMM) to address unbounded BDSAs. Second, we present a container resource allocation mechanism (CRAM) for optimal workload distribution to meet the real-time demands of competing containerized BDSAs. We formulate the problem as an n-player non-cooperative game among a set of heterogeneous containerized BDSAs. Finally, we incorporate a dynamic incentive-compatible pricing scheme that coordinates the decisions of self-interested BDSAs to maximize the CSB’s surplus. Experimental results demonstrate the eﬀectiveness of our approaches. Cloud Computing Big Data Resource Prediction Resource Allocation Stream Processing Game Theory Layered Hidden Markov Model Resource Management Container-Clusters Virtual Machines Streaming Applications Nash Equilibrium Queuing Theory Dynamic Pricing Resource scaling
76	Semantically-enabled stream processing and complex event processing over RDF graph streams / Traitement de flux sémantiquement activé et traitement d'évènements complexes sur des flux de graphe RDF Gillani, Syed 04 November 2016 (has links) Résumé en français non fourni par l'auteur. / There is a paradigm shift in the nature and processing means of today’s data: data are used to being mostly static and stored in large databases to be queried. Today, with the advent of new applications and means of collecting data, most applications on the Web and in enterprises produce data in a continuous manner under the form of streams. Thus, the users of these applications expect to process a large volume of data with fresh low latency results. This has resulted in the introduction of Data Stream Processing Systems (DSMSs) and a Complex Event Processing (CEP) paradigm – both with distinctive aims: DSMSs are mostly employed to process traditional query operators (mostly stateless), while CEP systems focus on temporal pattern matching (stateful operators) to detect changes in the data that can be thought of as events. In the past decade or so, a number of scalable and performance intensive DSMSs and CEP systems have been proposed. Most of them, however, are based on the relational data models – which begs the question for the support of heterogeneous data sources, i.e., variety of the data. Work in RDF stream processing (RSP) systems partly addresses the challenge of variety by promoting the RDF data model. Nonetheless, challenges like volume and velocity are overlooked by existing approaches. These challenges require customised optimisations which consider RDF as a first class citizen and scale the processof continuous graph pattern matching. To gain insights into these problems, this thesis focuses on developing scalable RDF graph stream processing, and semantically-enabled CEP systems (i.e., Semantic Complex Event Processing, SCEP). In addition to our optimised algorithmic and data structure methodologies, we also contribute to the design of a new query language for SCEP. Our contributions in these two fields are as follows: • RDF Graph Stream Processing. We first propose an RDF graph stream model, where each data item/event within streams is comprised of an RDF graph (a set of RDF triples). Second, we implement customised indexing techniques and data structures to continuously process RDF graph streams in an incremental manner. • Semantic Complex Event Processing. We extend the idea of RDF graph stream processing to enable SCEP over such RDF graph streams, i.e., temporalpattern matching. Our first contribution in this context is to provide a new querylanguage that encompasses the RDF graph stream model and employs a set of expressive temporal operators such as sequencing, kleene-+, negation, optional,conjunction, disjunction and event selection strategies. Based on this, we implement a scalable system that employs a non-deterministic finite automata model to evaluate these operators in an optimised manner. We leverage techniques from diverse fields, such as relational query optimisations, incremental query processing, sensor and social networks in order to solve real-world problems. We have applied our proposed techniques to a wide range of real-world and synthetic datasets to extract the knowledge from RDF structured data in motion. Our experimental evaluations confirm our theoretical insights, and demonstrate the viability of our proposed methods Traitement de flux Traitement d'évènements complexes Graphes RDF Optimisations de question Ebauche de requête Web sémantique Requêtes top-k Données de graphes Stream processing Complex event processing RDF graphs Query optimisations Query design Semantic web Top-k queries Graph databases
77	Analysis and coordination of mixed-criticality cyber-physical systems Maurer, Simon January 2018 (has links) A Cyber-physical System (CPS) can be described as a network of interlinked, concurrent computational components that interact with the physical world. Such a system is usually of reactive nature and must satisfy strict timing requirements to guarantee a correct behaviour. The components can be of mixed-criticality which implies different progress models and communication models, depending whether the focus of a component lies on predictability or resource efficiency. In this dissertation I present a novel approach that bridges the gap between stream processing models and Labelled Transition Systems (LTSs). The former offer powerful tools to describe concurrent systems of, usually simple, components while the latter allow to describe complex, reactive, components and their mutual interaction. In order to achieve the bridge between the two domains I introduce the novel LTS Synchronous Interface Automaton (SIA) that allows to model the interaction protocol of a process via its interface and to incrementally compose simple processes into more complex ones while preserving the system properties. Exploiting these properties I introduce an analysis to identify permanent blocking situations in a network of composed processes. SIAs are wrapped by the novel component-based coordination model Process Network with Synchronous Communication (PNSC) that allows to describe a network of concurrent processes where multiple communication models and the co-existence and interaction of heterogeneous processes is supported due to well defined interfaces. The work presented in this dissertation follows a holistic approach which spans from the theory of the underlying model to an instantiation of the model as a novel coordination language, called Streamix. The language uses network operators to compose networks of concurrent processes in a structured and hierarchical way. The work is validated by a prototype implementation of a compiler and a Run-time System (RTS) that allows to compile a Streamix program and execute it on a platform with support for ISO C, POSIX threads, and a Linux operating system.
78	Geo-distributed multi-layer stream aggregation Cannalire, Pietro January 2018 (has links) The standard processing architectures are enough to satisfy a lot of applications by employing already existing stream processing frameworks which are able to manage distributed data processing. In some specific cases, having geographically distributed data sources requires to distribute even more the processing over a large area by employing a geographically distributed architecture.‌ The issue addressed in this work is the reduction of data movement across the network which is continuously flowing in a geo-distributed architecture from streaming sources to the processing location and among processing entities within the same distributed cluster. Reduction of data movement can be critical for decreasing bandwidth costs since accessing links placed in the middle of the network can be costly and can increase as the amount of data exchanges increase. In this work we want to create a different concept to deploy geographically distributed architectures by relying on Apache Spark Structured Streaming and Apache Kafka. The features needed for an algorithm to run on a geo-distributed architecture are provided. The algorithms to be executed on this architecture apply the windowing and the data synopses techniques to produce a summaries of the input data and to address issues of the geographically distributed architecture. The computation of the average and the Misra-Gries algorithm are then implemented to test the designed architecture. This thesis work contributes in providing a new model of building geographically distributed architecture. The experimental results show that, for the algorithms running on top of the geo distributed architecture, the computation time is reduced on average by 70% compared to the distributed setup. Similarly, and the amount of data exchanged across the network is reduced on average by 99%, compared to the distributed setup. / Standardbehandlingsarkitekturer är tillräckligt för uppfylla behoven av många tillämpningar genom användning av befintliga ramverk för flödesbehandling med stöd för distribuerad databehandling. I specifika fall kan geografiskt fördelade datakällor kräva att databehandlingen fördelas över ett stort område med hjälp av en geografiskt distribuerad arkitektur. Problemet som behandlas i detta arbete är minskningen av kontinuerlig dataöverföring i ett nätverk med geo-distribuerad arkitektur. Minskad dataöverföring kan vara avgörande för minskade bandbreddskonstnader då åtkomst av länkar placerade i mitten av ett nätverk kan vara dyrt och öka ytterligare med tilltagande dataöverföring. I det här arbetet vill vi skapa ett nytt koncept för att upprätta geografiskt distribuerade arkitekturer med hjälp av Apache Spark Structured Streaming och Apache Kafka. Funktioner och förutsättningar som behövs för att en algoritm ska kunna köras på en geografisk distribuerad arkitektur tillhandahålls. Algoritmerna som ska köras på denna arkitektur tillämpar “windowing synopsing” och “data synopses”-tekniker för att framställa en sammanfattning av ingående data samt behandla problem beträffande den geografiskt fördelade arkitekturen. Beräkning av medelvärdet och Misra-Gries-algoritmen implementeras för att testa den konstruerade arkitekturen. Denna avhandling bidrar till att förse ny modell för att bygga geografiskt distribuerad arkitektur. Experimentella resultat visar att beräkningstiden reduceras i genomsnitt 70% för de algoritmer som körs ovanför den geo-distribuerade arkitekturen jämfört med den distribuerade konfigurationen. På liknande sätt reduceras mängden data som utväxlas över nätverket med 99% i snitt jämfört med den distribuerade inställningen. stream processing geo-distributed architecture algorithms windowing data synopses Apache Spark Structured Streaming Apache Kafka Misra-Gries algorithm flödesbehandling geo-distribuerade arkitekturen algoritmerna windowing data synopses Apache Spark Structured Streaming Apache Kafka Misra-Gries-algoritmen Computer and Information Sciences Data- och informationsvetenskap
79	Real-time Business Intelligence through Compact and Efficient Query Processing Under Updates Idris, Muhammad 05 March 2019 (has links) (PDF) Responsive analytics are rapidly taking over the traditional data analytics dominated by the post-fact approaches in traditional data warehousing. Recent advancements in analytics demand placing analytical engines at the forefront of the system to react to updates occurring at high speed and detect patterns, trends, and anomalies. These kinds of solutions find applications in Financial Systems, Industrial Control Systems, Business Intelligence and on-line Machine Learning among others. These applications are usually associated with Big Data and require the ability to react to constantly changing data in order to obtain timely insights and take proactive measures. Generally, these systems specify the analytical results or their basic elements in a query language, where the main task then is to maintain query results under frequent updates efficiently. The task of reacting to updates and analyzing changing data has been addressed in two ways in the literature: traditional business intelligence (BI) solutions focus on historical data analysis where the data is refreshed periodically and in batches, and stream processing solutions process streams of data from transient sources as flows of data items. Both kinds of systems share the niche of reacting to updates (known as dynamic evaluation), however, they differ in architecture, query languages, and processing mechanisms. In this thesis, we investigate the possibility of a reactive and unified framework to model queries that appear in both kinds of systems.In traditional BI solutions, evaluating queries under updates has been studied under the umbrella of incremental evaluation of queries that are based on the relational incremental view maintenance model and mostly focus on queries that feature equi-joins. Streaming systems, in contrast, generally follow automaton based models to evaluate queries under updates, and they generally process queries that mostly feature comparisons of temporal attributes (e.g. timestamp attributes) along with comparisons of non-temporal attributes over streams of bounded sizes. Temporal comparisons constitute inequality constraints while non-temporal comparisons can either be equality or inequality constraints. Hence these systems mostly process inequality joins. As a starting point for our research, we postulate the thesis that queries in streaming systems can also be evaluated efficiently based on the paradigm of incremental evaluation just like in BI systems in a main-memory model. The efficiency of such a model is measured in terms of runtime memory footprint and the update processing cost. To this end, the existing approaches of dynamic evaluation in both kinds of systems present a trade-off between memory footprint and the update processing cost. More specifically, systems that avoid materialization of query (sub)results incur high update latency and systems that materialize (sub)results incur high memory footprint. We are interested in investigating the possibility to build a model that can address this trade-off. In particular, we overcome this trade-off by investigating the possibility of practical dynamic evaluation algorithm for queries that appear in both kinds of systems and present a main-memory data representation that allows to enumerate query (sub)results without materialization and can be maintained efficiently under updates. We call this representation the Dynamic Constant Delay Linear Representation (DCLRs).We devise DCLRs with the following properties: 1) they allow, without materialization, enumeration of query results with bounded-delay (and with constant delay for a sub-class of queries), 2) they allow tuple lookup in query results with logarithmic delay (and with constant delay for conjunctive queries with equi-joins only), 3) they take space linear in the size of the database, 4) they can be maintained efficiently under updates. We first study the DCLRs with the above-described properties for the class of acyclic conjunctive queries featuring equi-joins with projections and present the dynamic evaluation algorithm called the Dynamic Yannakakis (DYN) algorithm. Then, we present the generalization of the DYN algorithm to the class of acyclic queries featuring multi-way Theta-joins with projections and call it Generalized DYN (GDYN). We devise DCLRs with the above properties for acyclic conjunctive queries, and the working of DYN and GDYN over DCLRs are based on a particular variant of join trees, called the Generalized Join Trees (GJTs) that guarantee the above-described properties of DCLRs. We define GJTs and present algorithms to test a conjunctive query featuring Theta-joins for acyclicity and to generate GJTs for such queries. We extend the classical GYO algorithm from testing a conjunctive query with equalities for acyclicity to testing a conjunctive query featuring multi-way Theta-joins with projections for acyclicity. We further extend the GYO algorithm to generate GJTs for queries that are acyclic.GDYN is hence a unified framework based on DCLRs that enables processing of queries that appear in streaming systems as well as in BI systems in a unified main-memory model and addresses the space-time trade-off. We instantiate GDYN to the particular case where all Theta-joins involve only equalities and inequalities and call this instantiation IEDYN. We implement DYN and IEDYN as query compilers that generate executable programs in the Scala programming language and provide all the necessary data structures and their maintenance and enumeration methods in a continuous stream processing model. We evaluate DYN and IEDYN against state-of-the-art BI and streaming systems on both industrial and synthetically generated benchmarks. We show that DYN and IEDYN outperform the existing systems by over an order of magnitude efficiency in both memory footprint and update processing time. / Doctorat en Sciences de l'ingénieur et technologie / info:eu-repo/semantics/nonPublished Informatique générale Business Intelligence Databases Data Warehouse Query Processing Query Execution Real-time Analytics Stream Processing Complex Event Processing Information Flow Processing Joins Join Trees Main-Memory System Inequality Joins Theta Joins Analytical Processing Query Language Acyclic Joins Join Algorithms Acyclicity
80	Quality of Service Aware Mechanisms for (Re)Configuring Data Stream Processing Applications on Highly Distributed Infrastructure / Mécanismes prenant en compte la qualité de service pour la (re)configuration d’applications de traitement de flux de données sur une infrastructure hautement distribuée Da Silva Veith, Alexandre 23 September 2019 (has links) Une grande partie de ces données volumineuses ont plus de valeur lorsqu'elles sont analysées rapidement, au fur et à mesure de leur génération. Dans plusieurs scénarios d'application émergents, tels que les villes intelligentes, la surveillance opérationnelle de grandes infrastructures et l'Internet des Objets (Internet of Things), des flux continus de données doivent être traités dans des délais très brefs. Dans plusieurs domaines, ce traitement est nécessaire pour détecter des modèles, identifier des défaillances et pour guider la prise de décision. Les données sont donc souvent rassemblées et analysées par des environnements logiciels conçus pour le traitement de flux continus de données. Ces environnements logiciels pour le traitement de flux de données déploient les applications sous-la forme d'un graphe orienté ou de dataflow. Un dataflow contient une ou plusieurs sources (i.e. capteurs, passerelles ou actionneurs); opérateurs qui effectuent des transformations sur les données (e.g., filtrage et agrégation); et des sinks (i.e., éviers qui consomment les requêtes ou stockent les données). Nous proposons dans cette thèse un ensemble de stratégies pour placer les opérateurs dans une infrastructure massivement distribuée cloud-edge en tenant compte des caractéristiques des ressources et des exigences des applications. En particulier, nous décomposons tout d'abord le graphe d'application en identifiant quelques comportements tels que des forks et des joints, puis nous le plaçons dynamiquement sur l'infrastructure. Des simulations et un prototype prenant en compte plusieurs paramètres d'application démontrent que notre approche peut réduire la latence de bout en bout de plus de 50% et aussi améliorer d'autres métriques de qualité de service. L'espace de recherche de solutions pour la reconfiguration des opérateurs peut être énorme en fonction du nombre d'opérateurs, de flux, de ressources et de liens réseau. De plus, il est important de minimiser le coût de la migration tout en améliorant la latence. Des travaux antérieurs, Reinforcement Learning (RL) et Monte-Carlo Tree Searh (MCTS) ont été utilisés pour résoudre les problèmes liés aux grands nombres d’actions et d’états de recherche. Nous modélisons le problème de reconfiguration d'applications sous la forme d'un processus de décision de Markov (MDP) et étudions l'utilisation des algorithmes RL et MCTS pour concevoir des plans de reconfiguration améliorant plusieurs métriques de qualité de service. / A large part of this big data is most valuable when analysed quickly, as it is generated. Under several emerging application scenarios, such as in smart cities, operational monitoring of large infrastructure, and Internet of Things (IoT), continuous data streams must be processed under very short delays. In multiple domains, there is a need for processing data streams to detect patterns, identify failures, and gain insights. Data is often gathered and analysed by Data Stream Processing Engines (DSPEs).A DSPE commonly structures an application as a directed graph or dataflow. A dataflow has one or multiple sources (i.e., gateways or actuators); operators that perform transformations on the data (e.g., filtering); and sinks (i.e., queries that consume or store the data). Most complex operator transformations store information about previously received data as new data is streamed in. Also, a dataflow has stateless operators that consider only the current data. Traditionally, Data Stream Processing (DSP) applications were conceived to run in clusters of homogeneous resources or on the cloud. In a cloud deployment, the whole application is placed on a single cloud provider to benefit from virtually unlimited resources. This approach allows for elastic DSP applications with the ability to allocate additional resources or release idle capacity on demand during runtime to match the application requirements.We introduce a set of strategies to place operators onto cloud and edge while considering characteristics of resources and meeting the requirements of applications. In particular, we first decompose the application graph by identifying behaviours such as forks and joins, and then dynamically split the dataflow graph across edge and cloud. Comprehensive simulations and a real testbed considering multiple application settings demonstrate that our approach can improve the end-to-end latency in over 50% and even other QoS metrics. The solution search space for operator reassignment can be enormous depending on the number of operators, streams, resources and network links. Moreover, it is important to minimise the cost of migration while improving latency. Reinforcement Learning (RL) and Monte-Carlo Tree Search (MCTS) have been used to tackle problems with large search spaces and states, performing at human-level or better in games such as Go. We model the application reconfiguration problem as a Markov Decision Process (MDP) and investigate the use of RL and MCTS algorithms to devise reconfiguring plans that improve QoS metrics. Mécanismes Qualité de service (re)configuration Infrastructure hautement distribuée Internet des Objets Réseau edge-cloud Théorie des files d'attente Processus de décision de Markov Reinforcement Learning Mechanisms Quality of Service (re)configuration Data Stream Processing Applications Highly Distributed Infrastructure Internet of Things Edge-cloud infrastructure Queueing theory Markov Decision Process Reinforcement Learning

Search results