Global ETD Search

1	Flexible and efficient computation in large data centres Gog, Ionel Corneliu January 2018 (has links) Increasingly, online computer applications rely on large-scale data analyses to offer personalised and improved products. These large-scale analyses are performed on distributed data processing execution engines that run on thousands of networked machines housed within an individual data centre. These execution engines provide, to the programmer, the illusion of running data analysis workflows on a single machine, and offer programming interfaces that shield developers from the intricacies of implementing parallel, fault-tolerant computations. Many such execution engines exist, but they embed assumptions about the computations they execute, or only target certain types of computations. Understanding these assumptions involves substantial study and experimentation. Thus, developers find it difficult to determine which execution engine is best, and even if they did, they become “locked in” because engineering effort is required to port workflows. In this dissertation, I first argue that in order to execute data analysis computations efficiently, and to flexibly choose the best engines, the way we specify data analysis computations should be decoupled from the execution engines that run the computations. I propose an architecture for decoupling data processing, together with Musketeer, my proof-of-concept implementation of this architecture. In Musketeer, developers express data analysis computations using their preferred programming interface. These are translated into a common intermediate representation from which code is generated and executed on the most appropriate execution engine. I show that Musketeer can be used to write data analysis computations directly, and these can execute on many execution engines because Musketeer automatically generates code that is competitive with optimised hand-written implementations. The diverse execution engines cause different workflow types to coexist within a data centre, opening up both opportunities for sharing and potential pitfalls for co-location interference. However, in practice, workflows are either placed by high-quality schedulers that avoid co-location interference, but choose placements slowly, or schedulers that choose placements quickly, but with unpredictable workflow run time due to co-location interference. In this dissertation, I show that schedulers can choose high-quality placements with low latency. I develop several techniques to improve Firmament, a high-quality min-cost flow-based scheduler, to choose placements quickly in large data centres. Finally, I demonstrate that Firmament chooses placements at least as good as other sophisticated schedulers, but at the speeds associated with simple schedulers. These contributions enable more efficient and effective use of data centres for large-scale computation than current solutions.
2	Contributions to large-scale data processing systems / Contributions aux systèmes de traitement de données à grande échelle Caneill, Matthieu 05 February 2018 (has links) Cette thèse couvre le sujet des systèmes de traitement de données àgrande échelle, et plus précisément trois approches complémentaires :la conception d'un système pour prédir des défaillances de serveursgrâce à l'analyse de leurs données de supervision; l'acheminement dedonnées dans un système à temps réel en étudiant les corrélationsentre les champs des messages pour favoriser la localité; etfinalement un environnement de développement innovateur pour concevoirdes transformations de donées en utilisant des graphes orientés deblocs.À travers le projet Smart Support Center, nous concevons unearchitecture qui passe à l'échelle, afin de stocker des sériestemporelles rapportées par des moteurs de supervision, qui vérifienten permanence la santé des systèmes informatiques. Nous utilisons cesdonnées pour effectuer des prédictions, et détecter de potentielsproblèmes avant qu'ils ne ne produisent.Nous nous plongeons ensuite dans les algorithmes d'acheminement pourles sytèmes de traitement de données en temps réel, et développons unecouche pour acheminer les messages plus efficacement, en évitant lesrebonds entre machines. Dans ce but, nous identifions en temps réelles corrélations qui apparaissent entre les champs de ces messages,tels les mots-clics et leur localisation géographique, par exempledans le cas de micromessages. Nous utilisons ces corrélations pourcréer des tables d'acheminement qui favorisent la colocation desacteurs traitant ces messages.Pour finir, nous présentons λ-blocks, un environnement dedéveloppement pour effectuer des tâches de transformations de donnéessans écrire de code source, mais en créant des graphes de blocs decode. L'environnement est rapide, et est distribué avec des pilesincluses: libraries de blocs, modules d'extension, et interfaces deprogrammation pour l'étendre. Il est également capable de manipulerdes graphes d'exécution, pour optimisation, analyse, vérification, outout autre but. / This thesis covers the topic of large-scale data processing systems,and more precisely three complementary approaches: the design of asystem to perform prediction about computer failures through theanalysis of monitoring data; the routing of data in a real-time systemlooking at correlations between message fields to favor locality; andfinally a novel framework to design data transformations usingdirected graphs of blocks.Through the lenses of the Smart Support Center project, we design ascalable architecture, to store time series reported by monitoringengines, which constantly check the health of computer systems. We usethis data to perform predictions, and detect potential problems beforethey arise.We then dive in routing algorithms for stream processing systems, anddevelop a layer to route messages more efficiently, by avoiding hopsbetween machines. For that purpose, we identify in real-time thecorrelations which appear in the fields of these messages, such ashashtags and their geolocation, for example in the case of tweets. Weuse these correlations to create routing tables which favor theco-location of actors handling these messages.Finally, we present λ-blocks, a novel programming framework to computedata processing jobs without writing code, but rather by creatinggraphs of blocks of code. The framework is fast, and comes withbatteries included: block libraries, plugins, and APIs to extendit. It is also able to manipulate computation graphs, foroptimization, analyzis, verification, or any other purposes. Abstractions de programmation Smart support center Composants Localité Lambda blocks Large-Scale data processing Programming abstractions Smart support center Components Locality Lambda blocks 004

Search results

Flexible and efficient computation in large data centres

Contributions to large-scale data processing systems / Contributions aux systèmes de traitement de données à grande échelle