Global ETD Search

321	Automating telemetry- and trace-based analytics on large-scale distributed systems Ateş, Emre 28 September 2020 (has links) Large-scale distributed systems---such as supercomputers, cloud computing platforms, and distributed applications---routinely suffer from slowdowns and crashes due to software and hardware problems, resulting in reduced efficiency and wasted resources. These large-scale systems typically deploy monitoring or tracing systems that gather a variety of statistics about the state of the hardware and the software. State-of-the-art methods either analyze this data manually, or design unique automated methods for each specific problem. This thesis builds on the vision that generalized automated analytics methods on the data sets collected from these complex computing systems provide critical information about the causes of the problems, and this analysis can then enable proactive management to improve performance, resilience, efficiency, or security significantly beyond current limits. This thesis seeks to design scalable, automated analytics methods and frameworks for large-scale distributed systems that minimize dependency on expert knowledge, automate parts of the solution process, and help make systems more resilient. In addition to analyzing data that is already collected from systems, our frameworks also identify what to collect from where in the system, such that the collected data would be concise and useful for manual analytics. We focus on two data sources for conducting analytics: numeric telemetry data, which is typically collected from operating system or hardware counters, and end-to-end traces collected from distributed applications. This thesis makes the following contributions in large-scale distributed systems: (1) Designing a framework for accurately diagnosing previously encountered performance variations, (2) designing a technique for detecting (unwanted) applications running on the systems, (3) developing a suite for reproducing performance variations that can be used to systematically develop analytics methods, (4) designing a method to explain predictions of black-box machine learning frameworks, and (5) constructing an end-to-end tracing framework that can dynamically adjust instrumentation for effective diagnosis of performance problems. / 2021-09-28T00:00:00Z Computer engineering Distributed systems Explainability High performance computing Machine learning Monitoring Tracing
322	Developing for Resilience: Introducing a Chaos Engineering tool Monge Solano, Ignacio, Matók, Enikő January 2020 (has links) Software complexity continues to accelerate, as new tools, frameworks, and technologiesbecome available. This, in turn, increases its fragility and liability. Despite the amount ofinvestment to test and harden their systems, companies still pay the price of failure. Towithstand this fast-paced development environment and ensure software availability, largescalesystems must be built with resilience in mind. Chaos Engineering is a new practicethat aims to assess some of these challenges. In this thesis, the methodology, requirements,and iterations of the system design and architecture for a chaos engineering tool arepresented. In a matter of only a couple of months and the working hours of two engineers, itwas possible to build a tool that is able to shed light on the attributes that make the targetedsystem resilient as well as the weaknesses in its failure handling mechanisms. This toolgreatly reduces the otherwise manual testing labor and allows software engineering teamsto find potentially costly failures. These results prove the benefits that many companiescould experience in their return of investment by adopting the practice of ChaosEngineering. chaos engineering fault injection resilience testing distributed systems Engineering and Technology Teknik och teknologier
323	Restoring Consistency after Network Partitions Asplund, Mikael January 2007 (has links) The software industry is facing a great challenge. While systems get more complex and distributed across the world, users are becoming more dependent on their availability. As systems increase in size and complexity so does the risk that some part will fail. Unfortunately, it has proven hard to tackle faults in distributed systems without a rigorous approach. Therefore, it is crucial that the scientific community can provide answers to how distributed computer systems can continue functioning despite faults. Our contribution in this thesis is regarding a special class of faults which occurs whennetwork links fail in such a way that parts of the network become isolated, such faults are termed network partitions. We consider the problem of how systems that have integrity constraints on data can continue operating in presence of a network partition. Such a system must act optimistically while the network is split and then perform a some kind of reconciliation to restore consistency afterwards. We have formally described four reconciliation algorithms and proven them correct. The novelty of these algorithms lies in the fact that they can restore consistency after network partitions in a system with integrity constraints and that one of the protocols allows the system to provide service during the reconciliation. We have implemented and evaluated the algorithms using simulation and as part of a partition-tolerant CORBA middleware. The results indicate that it pays oﬀ to act optimistically and that it is worthwhile to provide service during reconciliation. distributed systems fault tolerance network partitions dependability integrity constraints Computer Sciences Datavetenskap (datalogi)
324	Cohérence dans les systèmes de stockage distribués : fondements théoriques avec applications au cloud storage / Consistency in distributed storage systems : theoretical foundations with applications to cloud storage Viotti, Paolo 06 April 2017 (has links) La conception des systèmes distribués est une tâche onéreuse : les objectifs de performance, d’exactitude et de fiabilité sont étroitement liés et ont donné naissance à des compromis complexes décrits par de nombreux résultats théoriques. Ces compromis sont devenus de plus en plus importants à mesure que le calcul et le stockage se sont déplacés vers des architectures distribuées. De plus, l’absence d’approches systématiques de ces problèmes dans les outils de programmation modernes les a aggravés — d’autant que de nos jours la plupart des programmeurs doivent relever les défis liés aux applications distribués. En conséquence, il existe un écart évident entre les abstractions de programmation, les exigences d’application et la sémantique de stockage, ce qui entrave le travail des concepteurs et des développeurs. Cette thèse présente un ensemble de contributions tourné vers la conception de systèmes de stockage distribués fiables, en examinant ces questions à travers le prisme de la cohérence. Nous commençons par fournir un cadre uniforme et déclarative pour définir formellement les modèles de cohérence. Nous utilisons ce cadre pour décrire et comparer plus de cinquante modèles de cohérence non transactionnelles proposés dans la littérature. La nature déclarative et composite de ce cadre nous permet de construire un classement partiel des modèles de cohérence en fonction de leur force sémantique. Nous montrons les avantages pratiques de la composabilité en concevant et en implémentant Hybris, un système de stockage qui utilise différents modèles pour améliorer la cohérence faible généralement offerte par les services de stockage dans les nuages. Nous démontrons l’efficacité d’Hybris et montrons qu’il peut tolérer les erreurs arbitraires des services du nuage au prix des pannes. Enfin, nous proposons une nouvelle technique pour vérifier les garanties de cohérence offertes par les systèmes de stockage du monde réel. Cette technique s’appuie sur notre approche déclarative de la cohérence : nous considérons les modèles de cohérence comme invariants sur les représentations graphiques des exécutions des systèmes de stockage. Une mise en œuvre préliminaire prouve cette approche pratique et utile pour améliorer l’état de l’art sur la vérification de la cohérence. / Engineering distributed systems is an onerous task: the design goals of performance, correctness and reliability are intertwined in complex tradeoffs, which have been outlined by multiple theoretical results. These tradeoffs have become increasingly important as computing and storage have shifted towards distributed architectures. Additionally, the general lack of systematic approaches to tackle distribution in modern programming tools, has worsened these issues — especially as nowadays most programmers have to take on the challenges of distribution. As a result, there exists an evident divide between programming abstractions, application requirements and storage semantics, which hinders the work of designers and developers.This thesis presents a set of contributions towards the overarching goal of designing reliable distributed storage systems, by examining these issues through the prism of consistency. We begin by providing a uniform, declarative framework to formally define consistency semantics. We use this framework to describe and compare over fifty non-transactional consistency semantics proposed in previous literature. The declarative and composable nature of this framework allows us to build a partial order of consistency models according to their semantic strength. We show the practical benefits of composability by designing and implementing Hybris, a storage system that leverages different models and semantics to improve over the weak consistency generally offered by public cloud storage platforms. We demonstrate Hybris’ efficiency and show that it can tolerate arbitrary faults of cloud stores at the cost of tolerating outages. Finally, we propose a novel technique to verify the consistency guarantees offered by real-world storage systems. This technique leverages our declarative approach to consistency: we consider consistency semantics as invariants over graph representations of storage systems executions. A preliminary implementation proves this approach practical and useful in improving over the state-of-the-art on consistency verification. Systèmes distribués Storage Consistency Scalability Distributed systems Stockage informatique Cohérence Extensibilité
325	A Method and Tool for Finding Concurrency Bugs Involving Multiple Variables with Application to Modern Distributed Systems Sun, Zhuo 05 November 2018 (has links) Concurrency bugs are extremely hard to detect due to huge interleaving space. They are happening in the real world more often because of the prevalence of multi-threaded programs taking advantage of multi-core hardware, and microservice based distributed systems moving more and more applications to the cloud. As the most common non-deadlock concurrency bugs, atomicity violations are studied in many recent works, however, those methods are applicable only to single-variable atomicity violation, and don't consider the specific challenge in distributed systems that have both pessimistic and optimistic concurrency control. This dissertation presents a tool using model checking to predict atomicity violation concurrency bugs involving two shared variables or shared resources. We developed a unique method inferring correlation between shared variables in multi-threaded programs and shared resources in microservice based distributed systems, that is based on dynamic analysis and is able to detect the correlation that would be missed by static analysis. For multi-threaded programs, we use a binary instrumentation tool to capture runtime information about shared variables and synchronization events, and for microservice based distributed systems, we use a web proxy to capture HTTP based traffic about API calls and the shared resources they access including distributed locks. Based on the detected correlation and runtime trace, the tool is powerful and can explore a vast interleaving space of a multi-threaded program or a microservice based distributed system given a small set of captured test runs. It is applicable to large real-world systems and can predict atomicity violations missed by other related works for multi-threaded programs and a couple of previous unknown atomicity violation in real world open source microservice based systems. A limitation is that redundant model checking may be performed if two recorded interleaved traces yield the same partial order model. atomicity violation model checking multi-threaded programs distributed systems multiple variable correlations Software Engineering Systems Architecture
326	Efficient data and metadata processing in large-scale distributed systems Shi, Rong, Shi January 2018 (has links) No description available. Computer Science Computer Engineering
327	Hardware Utilisation Techniques for Data Stream Processing Meldrum, Max January 2019 (has links) Recent years have seen an increase in use of the stream processing architecture to compose continuous analytics applications. This thesis presents the design of a Rust-based stream processor that adopts two separate techniques to tackle existing weaknesses in modern production-grade stream processors. The first technique employs a data analytics language on top of the streaming runtime, in order to provide both dataflow and low-level compiler optimisations. This technique is motivated by an analysis of the impact that the lack of compiler integration may have on the end-to-end performance of streaming pipelines in Apache Flink. In the second technique streaming operators are scheduled using a task-parallel approach to boost performance for skewed data distributions. The experimental results for data-parallel streaming pipelines in this thesis demonstrate, that the scheduling model of the prototype achieves performance improvements in skewed scenarios without exhibiting any significant losses in performance during uniform distributions. / Under senare år har användningen av strömbearbetningsarkitekturen ökat för att komponera kontinuerliga analysapplikationer. Denna avhandling presenterar designen av en Rust-baserad strömprocessor som använder två separata tekniker för att hantera befintliga svagheter i moderna strömprocessorer. Den första tekniken använder ett dataanalysspråk ovanpå strömprocessorn, för att ge både dataflöde och kompilatoroptimeringar på låg nivå. Denna teknik är motiverad av en analys av påverkan som bristen på kompilatorintegration kan ha på den slutliga prestandan för analysapplikationer i Apache Flink. I den andra tekniken schemaläggs strömningsoperatörer med hjälp av en uppgiftsparallell metod för att öka prestanda för skev datadistribution. De experimentella resultaten för data-parallella analysapplikationer i denna avhandling visar att schemaläggningsmodellen för prototypen uppnår prestandaförbättringar i ojämna distributioner utan att uppvisa några betydande förluster i prestanda under enhetliga fördelningar. Data Analytics Data Stream Processing Distributed Systems Computer and Information Sciences Data- och informationsvetenskap
328	Collaborative Web-Based Mapping of Real-Time Sensor Data Gadea, Cristian January 2011 (has links) The distribution of real-time GIS (Geographic Information System) data among users is now more important than ever as it becomes increasingly affordable and important for scientific and government agencies to monitor environmental phenomena in real-time. A growing number of sensor networks are being deployed all over the world, but there is a lack of solutions for their effective monitoring. Increasingly, GIS users need access to real-time sensor data from a variety of sources, and the data must be represented in a visually-pleasing way and be easily accessible. In addition, users need to be able to collaborate with each other to share and discuss specific sensor data. The real-time acquisition, analysis, and sharing of sensor data from a large variety of heterogeneous sensor sources is currently difficult due to the lack of a standard architecture to properly represent the dynamic properties of the data and make it readily accessible for collaboration between users. This thesis will present a JEE-based publisher/subscriber architecture that allows real-time sensor data to be displayed collaboratively on the web, requiring users to have nothing more than a web browser and Internet connectivity to gain access to that data. The proposed architecture is evaluated by showing how an AJAX-based and a Flash-based web application are able to represent the real-time sensor data within novel collaborative environments. By using the latest web-based technology and relevant open standards, this thesis shows how map data and GIS data can be made more accessible, more collaborative and generally more useful. Real-Time Web Collaboration Distributed Systems Geographic Information Systems Web Collaboration
329	Reducing Long Tail Latencies in Geo-Distributed Systems Bogdanov, Kirill January 2016 (has links) Computing services are highly integrated into modern society. Millions of people rely on these services daily for communication, coordination, trading, and accessing to information. To meet high demands, many popular services are implemented and deployed as geo-distributed applications on top of third party virtualized cloud providers. However, the nature of such deployment provides variable performance characteristics. To deliver high quality of service, such systems strive to adapt to ever-changing conditions by monitoring changes in state and making run-time decisions, such as choosing server peering, replica placement, and quorum selection. In this thesis, we seek to improve the quality of run-time decisions made by geo-distributed systems. We attempt to achieve this through: (1) a better understanding of the underlying deployment conditions, (2) systematic and thorough testing of the decision logic implemented in these systems, and (3) by providing a clear view into the network and system states which allows these services to perform better-informed decisions. We performed a long-term cross datacenter latency measurement of the Amazon EC2 cloud provider. We used this data to quantify the variability of network conditions and demonstrated its impact on the performance of the systems deployed on top of this cloud provider. Next, we validate an application’s decision logic used in popular storage systems by examining replica selection algorithms. We introduce GeoPerf, a tool that uses symbolic execution and lightweight modeling to perform systematic testing of replica selection algorithms. We applied GeoPerf to test two popular storage systems and we found one bug in each. Then, using traceroute and one-way delay measurements across EC2, we demonstrated persistent correlation between network paths and network latency. We introduce EdgeVar, a tool that decouples routing and congestion based changes in network latency. By providing this additional information, we improved the quality of latency estimation, as well as increased the stability of network path selection. Finally, we introduce Tectonic, a tool that tracks an application’s requests and responses both at the user and kernel levels. In combination with EdgeVar, it provides a complete view of the delays associated with each processing stage of a request and response. Using Tectonic, we analyzed the impact of sharing CPUs in a virtualized environment and can infer the hypervisor’s scheduling policies. We argue for the importance of knowing these policies and propose to use them in applications’ decision making process. / Databehandlingstjänster är en välintegrerad del av det moderna samhället. Miljontals människor förlitar sig dagligen på dessa tjänster för kommunikation, samordning, handel, och åtkomst till information. För att möta höga krav implementeras och placeras många populära tjänster som geo-fördelning applikationer ovanpå tredje parters virtuella molntjänster. Det ligger emellertid i sakens natur att sådana utplaceringar resulterar i varierande prestanda. För att leverera höga servicekvalitetskrav behöver sådana system sträva efter att ständigt anpassa sig efter ändrade förutsättningar genom att övervaka tillståndsändringar och ta realtidsbeslut, som till exempel val av server peering, replika placering, och val av kvorum. Den här avhandlingen avser att förbättra kvaliteten på realtidsbeslut tagna av geo-fördelning system. Detta kan uppnås genom: (1) en bättre förståelse av underliggande utplaceringsvillkor, (2) systematisk och noggrann testning av beslutslogik redan implementerad i dessa system, och (3) en tydlig inblick i nätverket och systemtillstånd som tillåter dessa tjänster att utföra mer informerade beslut. Vi utförde en långsiktig korsa datacenter latensmätning av Amazons EC2 molntjänst. Mätdata användes sedan till att kvantifiera variationen av nätverkstillstånd och demonstrera dess inverkan på prestanda för system placerade ovanpå denna molntjänst. Därnäst validerades en applikations beslutslogik vanlig i populära lagringssystem genom att undersöka replika valalgoritmen. GeoPerf, ett verktyg som tillämpar symbolisk exekvering och lättviktsmodellering för systematisk testning av replika valalgoritmen, användes för att testa två populära lagringssystem och vi hittade en bugg i båda. Genom traceroute och envägslatensmätningar över EC2 demonstrerar vi ihängande korrelation mellan nätverksvägar och nätverkslatens. Vi introducerar också EdgeVar, ett verktyg som frikopplar dirigering och trängsel baserat på förändringar i nätverkslatens. Genom att tillhandahålla denna ytterligare information förbättrade vi kvaliteten på latensuppskattningen och stabiliteten på nätverkets val av väg. Slutligen introducerade vi Tectonic, ett verktyg som följer en applikations begäran och gensvar på både användare-läge och kernel-läge. Tillsammans med EdgeVar förses en komplett bild av fördröjningar associerade med varje beräkningssteg av begäran och gensvar. Med Tectonic kunde vi analysera inverkan av att dela CPUer i en virtuell miljö och kan avslöja hypervisor schemaläggningsprinciper. Vi argumenterar för betydelsen av att känna till dessa principer och föreslå användningen av de i beslutsprocessen. / <p>QC 20161101</p> Cloud Computing Geo-Distributed Systems Replica Selection Algorithms Communication Systems Kommunikationssystem
330	Ablation Programming for Machine Learning Sheikholeslami, Sina January 2019 (has links) As machine learning systems are being used in an increasing number of applications from analysis of satellite sensory data and health-care analytics to smart virtual assistants and self-driving cars they are also becoming more and more complex. This means that more time and computing resources are needed in order to train the models and the number of design choices and hyperparameters will increase as well. Due to this complexity, it is usually hard to explain the effect of each design choice or component of the machine learning system on its performance.A simple approach for addressing this problem is to perform an ablation study, a scientific examination of a machine learning system in order to gain insight on the effects of its building blocks on its overall performance. However, ablation studies are currently not part of the standard machine learning practice. One of the key reasons for this is the fact that currently, performing an ablation study requires major modifications in the code as well as extra compute and time resources.On the other hand, experimentation with a machine learning system is an iterative process that consists of several trials. A popular approach for execution is to run these trials in parallel, on an Apache Spark cluster. Since Apache Spark follows the Bulk Synchronous Parallel model, parallel execution of trials includes several stages, between which there will be barriers. This means that in order to execute a new set of trials, all trials from the previous stage must be finished. As a result, we usually end up wasting a lot of time and computing resources on unpromising trials that could have been stopped soon after their start.We have attempted to address these challenges by introducing MAGGY, an open-source framework for asynchronous and parallel hyperparameter optimization and ablation studies with Apache Spark and TensorFlow. This framework allows for better resource utilization as well as ablation studies and hyperparameter optimization in a unified and extendable API. / Eftersom maskininlärningssystem används i ett ökande antal applikationer från analys av data från satellitsensorer samt sjukvården till smarta virtuella assistenter och självkörande bilar blir de också mer och mer komplexa. Detta innebär att mer tid och beräkningsresurser behövs för att träna modellerna och antalet designval och hyperparametrar kommer också att öka. På grund av denna komplexitet är det ofta svårt att förstå vilken effekt varje komponent samt designval i ett maskininlärningssystem har på slutresultatet.En enkel metod för att få insikt om vilken påverkan olika komponenter i ett maskinlärningssytem har på systemets prestanda är att utföra en ablationsstudie. En ablationsstudie är en vetenskaplig undersökning av maskininlärningssystem för att få insikt om effekterna av var och en av dess byggstenar på dess totala prestanda. Men i praktiken så är ablationsstudier ännu inte vanligt förekommande inom maskininlärning. Ett av de viktigaste skälen till detta är det faktum att för närvarande så krävs både stora ändringar av koden för att utföra en ablationsstudie, samt extra beräkningsoch tidsresurser.Vi har försökt att ta itu med dessa utmaningar genom att använda en kombination av distribuerad asynkron beräkning och maskininlärning. Vi introducerar maggy, ett ramverk med öppen källkodsram för asynkron och parallell hyperparameteroptimering och ablationsstudier med PySpark och TensorFlow. Detta ramverk möjliggör bättre resursutnyttjande samt ablationsstudier och hyperparameteroptimering i ett enhetligt och utbyggbart API. Distributed Machine Learning Distributed Systems Ablation Studies Apache Spark Keras Hopsworks Computer and Information Sciences Data- och informationsvetenskap

Search results