Global ETD Search

1	Towards Unifying Stream Processing over Central and Near-the-Edge Data Centers Peiro Sajjad, Hooman January 2016 (has links) In this thesis, our goal is to enable and achieve effective and efficient real-time stream processing in a geo-distributed infrastructure, by combining the power of central data centers and micro data centers. Our research focus is to address the challenges of distributing the stream processing applications and placing them closer to data sources and sinks. We enable applications to run in a geo-distributed setting and provide solutions for the network-aware placement of distributed stream processing applications across geo-distributed infrastructures. First, we evaluate Apache Storm, a widely used open-source distributed stream processing system, in the community network Cloud, as an example of a geo-distributed infrastructure. Our evaluation exposes new requirements for stream processing systems to function in a geo-distributed infrastructure. Second, we propose a solution to facilitate the optimal placement of the stream processing components on geo-distributed infrastructures. We present a novel method for partitioning a geo-distributed infrastructure into a set of computing clusters, each called a micro data center. According to our results, we can increase the minimum available bandwidth in the network and likewise, reduce the average latency to less than 50%. Next, we propose a parallel and distributed graph partitioner, called HoVerCut, for fast partitioning of streaming graphs. Since a lot of data can be presented in the form of graph, graph partitioning can be used to assign the graph elements to different data centers to provide data locality for efficient processing. Last, we provide an approach, called SpanEdge that enables stream processing systems to work on a geo-distributed infrastructure. SpenEdge unifies stream processing over the central and near-the-edge data centers (micro data centers). As a proof of concept, we implement SpanEdge by extending Apache Storm that enables it to run across multiple data centers. / <p>QC 20161005</p> geo-distributed stream processing geo-distributed infrastructure edge computing edge-based analytics Computer and Information Sciences Data- och informationsvetenskap
2	Reducing Long Tail Latencies in Geo-Distributed Systems Bogdanov, Kirill January 2016 (has links) Computing services are highly integrated into modern society. Millions of people rely on these services daily for communication, coordination, trading, and accessing to information. To meet high demands, many popular services are implemented and deployed as geo-distributed applications on top of third party virtualized cloud providers. However, the nature of such deployment provides variable performance characteristics. To deliver high quality of service, such systems strive to adapt to ever-changing conditions by monitoring changes in state and making run-time decisions, such as choosing server peering, replica placement, and quorum selection. In this thesis, we seek to improve the quality of run-time decisions made by geo-distributed systems. We attempt to achieve this through: (1) a better understanding of the underlying deployment conditions, (2) systematic and thorough testing of the decision logic implemented in these systems, and (3) by providing a clear view into the network and system states which allows these services to perform better-informed decisions. We performed a long-term cross datacenter latency measurement of the Amazon EC2 cloud provider. We used this data to quantify the variability of network conditions and demonstrated its impact on the performance of the systems deployed on top of this cloud provider. Next, we validate an application’s decision logic used in popular storage systems by examining replica selection algorithms. We introduce GeoPerf, a tool that uses symbolic execution and lightweight modeling to perform systematic testing of replica selection algorithms. We applied GeoPerf to test two popular storage systems and we found one bug in each. Then, using traceroute and one-way delay measurements across EC2, we demonstrated persistent correlation between network paths and network latency. We introduce EdgeVar, a tool that decouples routing and congestion based changes in network latency. By providing this additional information, we improved the quality of latency estimation, as well as increased the stability of network path selection. Finally, we introduce Tectonic, a tool that tracks an application’s requests and responses both at the user and kernel levels. In combination with EdgeVar, it provides a complete view of the delays associated with each processing stage of a request and response. Using Tectonic, we analyzed the impact of sharing CPUs in a virtualized environment and can infer the hypervisor’s scheduling policies. We argue for the importance of knowing these policies and propose to use them in applications’ decision making process. / Databehandlingstjänster är en välintegrerad del av det moderna samhället. Miljontals människor förlitar sig dagligen på dessa tjänster för kommunikation, samordning, handel, och åtkomst till information. För att möta höga krav implementeras och placeras många populära tjänster som geo-fördelning applikationer ovanpå tredje parters virtuella molntjänster. Det ligger emellertid i sakens natur att sådana utplaceringar resulterar i varierande prestanda. För att leverera höga servicekvalitetskrav behöver sådana system sträva efter att ständigt anpassa sig efter ändrade förutsättningar genom att övervaka tillståndsändringar och ta realtidsbeslut, som till exempel val av server peering, replika placering, och val av kvorum. Den här avhandlingen avser att förbättra kvaliteten på realtidsbeslut tagna av geo-fördelning system. Detta kan uppnås genom: (1) en bättre förståelse av underliggande utplaceringsvillkor, (2) systematisk och noggrann testning av beslutslogik redan implementerad i dessa system, och (3) en tydlig inblick i nätverket och systemtillstånd som tillåter dessa tjänster att utföra mer informerade beslut. Vi utförde en långsiktig korsa datacenter latensmätning av Amazons EC2 molntjänst. Mätdata användes sedan till att kvantifiera variationen av nätverkstillstånd och demonstrera dess inverkan på prestanda för system placerade ovanpå denna molntjänst. Därnäst validerades en applikations beslutslogik vanlig i populära lagringssystem genom att undersöka replika valalgoritmen. GeoPerf, ett verktyg som tillämpar symbolisk exekvering och lättviktsmodellering för systematisk testning av replika valalgoritmen, användes för att testa två populära lagringssystem och vi hittade en bugg i båda. Genom traceroute och envägslatensmätningar över EC2 demonstrerar vi ihängande korrelation mellan nätverksvägar och nätverkslatens. Vi introducerar också EdgeVar, ett verktyg som frikopplar dirigering och trängsel baserat på förändringar i nätverkslatens. Genom att tillhandahålla denna ytterligare information förbättrade vi kvaliteten på latensuppskattningen och stabiliteten på nätverkets val av väg. Slutligen introducerade vi Tectonic, ett verktyg som följer en applikations begäran och gensvar på både användare-läge och kernel-läge. Tillsammans med EdgeVar förses en komplett bild av fördröjningar associerade med varje beräkningssteg av begäran och gensvar. Med Tectonic kunde vi analysera inverkan av att dela CPUer i en virtuell miljö och kan avslöja hypervisor schemaläggningsprinciper. Vi argumenterar för betydelsen av att känna till dessa principer och föreslå användningen av de i beslutsprocessen. / <p>QC 20161101</p> Cloud Computing Geo-Distributed Systems Replica Selection Algorithms Communication Systems Kommunikationssystem
3	Efficient support for data-intensive scientific workflows on geo-distributed clouds / Support pour l'exécution efficace des workflows scientifiques à traitement intensif de données sur les cloud géo-distribués Pineda Morales, Luis Eduardo 24 May 2017 (has links) D’ici 2020, l’univers numérique atteindra 44 zettaoctets puisqu’il double tous les deux ans. Les données se présentent sous les formes les plus diverses et proviennent de sources géographiquement dispersées. L’explosion de données crée un besoin sans précédent en terme de stockage et de traitement de données, mais aussi en terme de logiciels de traitement de données capables d’exploiter au mieux ces ressources informatiques. Ces applications à grande échelle prennent souvent la forme de workflows qui aident à définir les dépendances de données entre leurs différents composants. De plus en plus de workflows scientifiques sont exécutés sur des clouds car ils constituent une alternative rentable pour le calcul intensif. Parfois, les workflows doivent être répartis sur plusieurs data centers. Soit parce qu’ils dépassent la capacité d’un site unique en raison de leurs énormes besoins de stockage et de calcul, soit car les données qu’ils traitent sont dispersées dans différents endroits. L’exécution de workflows multisite entraîne plusieurs problèmes, pour lesquels peu de solutions ont été développées : il n’existe pas de système de fichiers commun pour le transfert de données, les latences inter-sites sont élevées et la gestion centralisée devient un goulet d’étranglement. Cette thèse présente trois contributions qui visent à réduire l’écart entre les exécutions de workflows sur un seul site ou plusieurs data centers. Tout d’abord, nous présentons plusieurs stratégies pour le soutien efficace de l’exécution des workflows sur des clouds multisite en réduisant le coût des opérations de métadonnées. Ensuite, nous expliquons comment la manipulation sélective des métadonnées, classées par fréquence d’accès, améliore la performance des workflows dans un environnement multisite. Enfin, nous examinons une approche différente pour optimiser l’exécution de workflows sur le cloud en étudiant les paramètres d’exécution pour modéliser le passage élastique à l’échelle. / By 2020, the digital universe is expected to reach 44 zettabytes, as it is doubling every two years. Data come in the most diverse shapes and from the most geographically dispersed sources ever. The data explosion calls for applications capable of highlyscalable, distributed computation, and for infrastructures with massive storage and processing power to support them. These large-scale applications are often expressed as workflows that help defining data dependencies between their different components. More and more scientific workflows are executed on clouds, for they are a cost-effective alternative for intensive computing. Sometimes, workflows must be executed across multiple geodistributed cloud datacenters. It is either because these workflows exceed a single site capacity due to their huge storage and computation requirements, or because the data they process is scattered in different locations. Multisite workflow execution brings about several issues, for which little support has been developed: there is no common ile system for data transfer, inter-site latencies are high, and centralized management becomes a bottleneck. This thesis consists of three contributions towards bridging the gap between single- and multisite workflow execution. First, we present several design strategies to eficiently support the execution of workflow engines across multisite clouds, by reducing the cost of metadata operations. Then, we take one step further and explain how selective handling of metadata, classified by frequency of access, improves workflows performance in a multisite environment. Finally, we look into a different approach to optimize cloud workflow execution by studying some parameters to model and steer elastic scaling. Elastic scaling Données massives Traitement des données Scientific workflow Multisite clouds Metadata management Geo-distributed applications Elastic scaling 004.5
4	Geo-distributed multi-layer stream aggregation Cannalire, Pietro January 2018 (has links) The standard processing architectures are enough to satisfy a lot of applications by employing already existing stream processing frameworks which are able to manage distributed data processing. In some specific cases, having geographically distributed data sources requires to distribute even more the processing over a large area by employing a geographically distributed architecture.‌ The issue addressed in this work is the reduction of data movement across the network which is continuously flowing in a geo-distributed architecture from streaming sources to the processing location and among processing entities within the same distributed cluster. Reduction of data movement can be critical for decreasing bandwidth costs since accessing links placed in the middle of the network can be costly and can increase as the amount of data exchanges increase. In this work we want to create a different concept to deploy geographically distributed architectures by relying on Apache Spark Structured Streaming and Apache Kafka. The features needed for an algorithm to run on a geo-distributed architecture are provided. The algorithms to be executed on this architecture apply the windowing and the data synopses techniques to produce a summaries of the input data and to address issues of the geographically distributed architecture. The computation of the average and the Misra-Gries algorithm are then implemented to test the designed architecture. This thesis work contributes in providing a new model of building geographically distributed architecture. The experimental results show that, for the algorithms running on top of the geo distributed architecture, the computation time is reduced on average by 70% compared to the distributed setup. Similarly, and the amount of data exchanged across the network is reduced on average by 99%, compared to the distributed setup. / Standardbehandlingsarkitekturer är tillräckligt för uppfylla behoven av många tillämpningar genom användning av befintliga ramverk för flödesbehandling med stöd för distribuerad databehandling. I specifika fall kan geografiskt fördelade datakällor kräva att databehandlingen fördelas över ett stort område med hjälp av en geografiskt distribuerad arkitektur. Problemet som behandlas i detta arbete är minskningen av kontinuerlig dataöverföring i ett nätverk med geo-distribuerad arkitektur. Minskad dataöverföring kan vara avgörande för minskade bandbreddskonstnader då åtkomst av länkar placerade i mitten av ett nätverk kan vara dyrt och öka ytterligare med tilltagande dataöverföring. I det här arbetet vill vi skapa ett nytt koncept för att upprätta geografiskt distribuerade arkitekturer med hjälp av Apache Spark Structured Streaming och Apache Kafka. Funktioner och förutsättningar som behövs för att en algoritm ska kunna köras på en geografisk distribuerad arkitektur tillhandahålls. Algoritmerna som ska köras på denna arkitektur tillämpar “windowing synopsing” och “data synopses”-tekniker för att framställa en sammanfattning av ingående data samt behandla problem beträffande den geografiskt fördelade arkitekturen. Beräkning av medelvärdet och Misra-Gries-algoritmen implementeras för att testa den konstruerade arkitekturen. Denna avhandling bidrar till att förse ny modell för att bygga geografiskt distribuerad arkitektur. Experimentella resultat visar att beräkningstiden reduceras i genomsnitt 70% för de algoritmer som körs ovanför den geo-distribuerade arkitekturen jämfört med den distribuerade konfigurationen. På liknande sätt reduceras mängden data som utväxlas över nätverket med 99% i snitt jämfört med den distribuerade inställningen. stream processing geo-distributed architecture algorithms windowing data synopses Apache Spark Structured Streaming Apache Kafka Misra-Gries algorithm flödesbehandling geo-distribuerade arkitekturen algoritmerna windowing data synopses Apache Spark Structured Streaming Apache Kafka Misra-Gries-algoritmen Computer and Information Sciences Data- och informationsvetenskap

1

Page generated in 0.0529 seconds