Global ETD Search

1	Popcorn Linux: A Compiler and Runtime for Execution Migration Between Heterogeneous-ISA Architectures Lyerly, Robert Frantz 25 April 2019 (has links) In recent years there has been a proliferation of parallel and heterogeneous architectures. As chip designers have hit fundamental limits in traditional processor scaling, they have begun rethinking processor architecture from the ground up. In addition to creating new classes of processors, chip designers have revisited CPU microarchitecture in order to target different computing contexts. CPUs have been optimized for low-power smartphones and extended for high-performance computing in order to achieve better performance energy efficiency for heavy computational tasks. Although heterogeneity adds significant complexity to both hardware and software, recent works have shown tremendous power and performance benefits obtainable through specialization. It is clear that emerging systems will be increasingly heterogeneous. Many of these emerging systems couple together cores of different instruction set architectures (ISA), due to both market forces and the potential performance and power benefits in optimizing application execution. However, differently from symmetric multiprocessors or even asymmetric single-ISA multiprocessors, natively compiled applications cannot freely migrate between heterogeneous-ISA processors. This is due to the fact that applications are compiled to an instruction set architecture-specific format which is incompatible on other instruction set architectures. This creates serious limitations, as execution migration is a fundamental mechanism used by schedulers to reach performance or fairness goals, allows applications to migrate between heterogeneous-ISA CPUs in order to accelerate parallel applications or even leverage ISA-heterogeneity for security benefits. This dissertation describes system software for automatically migrating natively compiled applications across heterogeneous-ISA processors. This dissertation describes the implementation and evaluation of a complete software stack on commodity scale heterogeneous-ISA CPUs, emulating datacenters with heterogeneous-ISA systems or future systems that tightly integrate heterogeneous-ISA CPUs via point-to-point interconnect. This dissertation describes a compiler which builds applications for heterogeneous-ISA execution migration. The compiler generates machine code for every architecture in the system and lays out the application's code and data in a common format. In addition, the compiler generates metadata used by a state transformation runtime to dynamically transform thread execution state between ISA-specific formats, allowing application threads to migrate between different ISAs. The compiler and runtime is evaluated in conjunction with a replicated-kernel operating system, which provides thread migration and distributed shared virtual memory across heterogeneous-ISA processors. This redesigned software stack is evaluated on a setup containing and ARM and an x86 processor interconnected via point-to-point interconnect over PCIe. This dissertation shows that sub-millisecond state transformation is achievable. Additionally, it shows that for a datacenter-like workload using benchmarks from the NAS Parallel Benchmark suite, the system can trade some performance for up to a 66% reduction in energy and up to an 11% reduction in energy-delay product. This dissertation then describes an exploration into using hardware transactional memory (HTM) to maximize scheduling flexibility. Because applications can only migrate between ISAs at program locations with specific properties, there may be a significant delay between when the scheduler wishes to migrate an application and when the application can respond to the migration request. In order to reduce this migration response time, this dissertation describes compiler instrumentation which uses HTM to allow the scheduler to force applications to roll back to the most recently encountered program location suitable for migration. This is evaluated both in terms of overhead and responsiveness to migration requests. In addition to showing the viability of the infrastructure for optimizing workload placement in a heterogeneous-ISA datacenter, this dissertation also demonstrates utilizing the infrastructure to accelerate multithreaded applications. This dissertation describes a new OpenMP runtime named libopenpop that is optimized for executing applications in heterogeneous- ISA systems with distributed shared virtual memory. The runtime utilizes synchronization primitives that enable scale-out execution across rack-scale systems and new work distribution mechanisms that predict the best partitioning of parallel work across CPUs with diverse architectural characteristics. libopenpop demonstrates sizable improvements over a na¨ıve OpenMP implementation – a 38x improvement in multi-server barrier latency, a 5.4x improvement in multi-server data reductions and a geometric mean speedup of 4.04x for scalable applications in an 8-node x86-64 cluster. For a heterogeneous system composed of a highly-clocked x86 server and a highly-parallel ARM server, libopenpop delivers up to a 4.7x speedup and a geometric mean speedup of 41% across benchmarks from several benchmark suites versus the best single-node homogeneous execution. Finally, this dissertation describes leveraging the compiler and state transformation runtime to provide enhanced security for applications. Because the compiler provides detailed information about the stack layout of applications, it can be leveraged to defend against exploits such as stack smashing attacks and return-oriented programming attacks. This dissertation describes Chameleon, a runtime which uses the compiler and state transformation infrastructure to continuously re-randomize the stack layout and code of vulnerable applications to thwart attackers. Chameleon attaches to applications using existing operating system interfaces and periodically switches the application to new randomized stack layouts and code by rewriting the stack. Chameleon enhances security with little overhead – it disrupts a geometric mean 76.32% of code gadgets in benchmark binaries, randomizes stack element locations with geometric mean 3 potential randomized locations, and has 1.1% overhead when re-randomizing every 50 milliseconds, making it extremely difficult for attackers to exploit target applications. / Doctor of Philosophy / Computer processors have experienced unprecedented performance improvements over the past 50 years. However, due to physical limitations of how processors execute, in recent years this performance growth has started to slow. In order to continue scaling performance, chip designers have begun diversifying processor designs to meet different performance and power consumption targets. Processors specialized for different contexts use various instruction set architectures (ISAs), the operations made available for use by the hardware. Programs built for one instruction set architecture are not compatible with others, requiring developers to build complex applications to manually bridge the gap. This leads to brittle applications and prevents the system software managing the processors from adapting workloads to match processor characteristics. This dissertation presents the Popcorn Linux system software which provides transparent support for running applications across computers composed of processors of multiple ISAs. Popcorn Linux provides the ability to migrate applications between these processors without requiring developers to add any application instrumentation – the system software manages all the details of building and migrating applications. An evaluation of Popcorn Linux shows that transparently migrating applications between diverse processors provides power and performance benefits in a variety of scenarios. Additionally, this dissertation describes leveraging the Popcorn Linux software infrastructure to harden applications against attackers seeking to hijack applications for malicious purposes. heterogeneous architectures compilers runtime systems
2	Hardening Software Against Memory Errors and Attacks Novark, Albert Eugene 01 February 2011 (has links) Programs written in C and C++ are susceptible to a number of memory errors, including buffer overflows and dangling pointers. At best, these errors cause crashes or performance degradation. At worst, they enable security vulnerabilities, allowing denial-of-service or remote code execution. Existing runtime systems provide little protection against these errors. They allow minor errors to cause crashes and allow attackers to consistently exploit vulnerabilities. In this thesis, we introduce a series of runtime systems that protect deployed applications from memory errors. To guide the design of our systems, we analyze how errors interact with memory allocators to allow consistent exploitation of vulnerabilities. Our systems improve software in two ways: first, they tolerate memory errors, allowing programs to continue proper execution. Second, they decrease the probability of successfully exploiting security vulnerabilities caused by memory errors. Our first system, Archipelago, protects exceptionally sensitive server applications against severe errors using an object-per-page randomized allocator. It provides near-100% protection against most buffer overflows. Our second system, DieHarder, combines ideas from Archipelago, DieHard, and other systems to enable maximal protection against attacks while incurring minimal runtime and memory overhead. Our final system, Exterminator, automatically corrects heap-based buffer overflows and dangling pointers without requiring programmer intervention. Exterminator relies on both a low-overhead randomized allocator and statistical inference techniques to automatically isolate and correct errors in deployed applications. Memory Management Reliability Runtime Systems Security Computer Sciences
3	Automatic Scheduling of Compute Kernels Across Heterogeneous Architectures Lyerly, Robert Frantz 24 June 2014 (has links) The world of high-performance computing has shifted from increasing single-core performance to extracting performance from heterogeneous multi- and many-core processors due to the power, memory and instruction-level parallelism walls. All trends point towards increased processor heterogeneity as a means for increasing application performance, from smartphones to servers. These various architectures are designed for different types of applications — traditional "big" CPUs (like the Intel Xeon) are optimized for low latency while other architectures (such as the NVidia Tesla K20x) are optimized for high-throughput. These architectures have different tradeoffs and different performance profiles, meaning fantastic performance gains for the right types of applications. However applications that are ill-suited for a given architecture may experience significant slowdown; therefore, it is imperative that applications are scheduled onto the correct processor. In order to perform this scheduling, applications must be analyzed to determine their execution characteristics. Traditionally this application-to-hardware mapping was determined statically by the programmer. However, this requires intimate knowledge of the application and underlying architecture, and precludes load-balancing by the system. We demonstrate and empirically evaluate a system for automatically scheduling compute kernels by extracting program characteristics and applying machine learning techniques. We develop a machine learning process that is system-agnostic, and works for a variety of contexts (e.g. embedded, desktop/workstation, server). Finally, we perform scheduling in a workload-aware and workload-adaptive manner for these compute kernels. / Master of Science High-Performance Computing Runtime Systems Heterogeneous Architectures Compilers Scheduling
4	Replication of Concurrent Applications in a Shared Memory Multikernel Wen, Yuzhong 19 July 2016 (has links) State Machine Replication (SMR) has become the de-facto methodology of building a replication based fault-tolerance system. Current SMR systems usually have multiple machines involved, each of the machines in the SMR system acts as the replica of others. However having multiple machines leads to more cost to the infrastructure, in both hardware cost and power consumption. For tolerating non-critical CPU and memory failure that will not crash the entire machine, there is no need to have extra machines to do the job. As a result, intra-machine replication is a good fit for this scenario. However, current intra-machine replication approaches do not provide strong isolation among the replicas, which allows the faults to be propagated from one replica to another. In order to provide an intra-machine replication technique with strong isolation, in this thesis we present a SMR system on a multi-kernel OS. We implemented a replication system that is capable of replicating concurrent applications on different kernel instances of a multi-kernel OS. Modern concurrent application can be deployed on our system with minimal code modification. Additionally, our system provides two different replication modes that allows the user to switch freely according to the application type. With the evaluation of multiple real world applications, we show that those applications can be easily deployed on our system with 0 to 60 lines of code changes to the source code. From the performance perspective, our system only introduces 0.23\% to 63.39\% overhead compared to non-replicated execution. / Master of Science State Machine Replication Runtime Systems Deterministic System System Software
5	Measuring, modeling, and optimizing counterintuitive performance phenomena in power-scalable, parallel systems Chang, Hung-Ching 09 April 2015 (has links) The demands of exascale computing systems and applications have pushed for a rapid, continual design paradigm coupled with increasing design complexities from the interaction between the application, the middleware, and the underlying system hardware, which forms a breeding ground for inefficiency. This work seeks to improve system efficiency by exposing the root causes of unexpected performance slowdowns (e.g., lower performance at higher processor speeds) that occur more frequently in power-scalable systems where raw processor speed varies. More precisely, we perform an exhaustive empirical study that conclusively shows that increasing processor speed often reduces performance and wastes energy. Our experimental work shows that the frequency of occurrence and magnitude of slowdowns grow with clock frequency and parallelism, indicating that such slowdowns will increasingly be observed with trends in processor and system design. Performance speedups at lower frequencies (or slowdowns at higher frequencies) have been anecdotally observed in the prevailing literature since 2004, but no research has explained nor exploited this phenomenon. This work conclusively demonstrates that performance slowdowns during processor speedup phases can exceed 47% in common I/O workloads. Our hypothesis challenges (and ultimately debunks) a fundamental assumption in computer systems: faster processor speeds result in the same or better performance. In this work, with the use of code and kernel instrumentation, exhaustive experiments, and deep insight into the inner workings of the Linux I/O subsystem, I overcome the aforementioned challenges of variance, complexity, and nondeterminism and identify the I/O resource contention as the root cause of the slowdowns during processor speedup. Specifically, such contention comes from the Linux kernel when the journaling block device (JBD) interacts with the ext3/4 file system that introduces file write delays and file synchronization delays. To fully explain how such I/O contention causes performance anomaly, I propose analytical models of resource contention among I/O threads to describe the root cause of the observed I/O slowdowns when processors speed up. To this end, I introduce LUC, a runtime system to limit the unintended consequences of power scaling and demonstrate the effectiveness of the LUC system for two critical parallel transaction-oriented workloads, including a mail server (varMail) and online transaction processing (oltp). / Ph. D. parallel and distributed processing I/O performance power runtime systems
6	Conception et réalisation d'un solveur pour les problèmes de dynamique des fluides pour les architectures many-core / Design of generic modular solutions for PDE solvers for modern architectures Genet, Damien 12 December 2014 (has links) La simulation numérique fait partie intégrante du processus d'analyse. Que l'on veuille concevoir le profil d'un véhicule, ou chercher à prévoir le résultat d'un forage pétrolier, la simulation numérique est devenue un outil complémentaire à la théorie et aux expérimentations. Cet outildoit produire des résultats précis en un minimum de temps. Pour cela, nous avons à disposition des méthodes numériques précises, et des machines de calcul aux performances importantes. Cet outil doit être générique sur les maillages, l'ordre de la solution, les méthodes numériques, et doitmaintenir ses performances sur les machines de calculs modernes avec une hiérarchie complexes d'unité de calculs. Nous présentons dans cette thèse le background mathématiques de deux classes de schémas numériques, les méthodes aux éléments finis continus et discontinus. Puis nous présentons les enjeux de la conception d'une plateforme en prenant en compte l'ensemble de ces contraintes. Ensuite nous nous intéressons au sous-problème de l'assemblage au dessus d'un support d'exécution. L'opération d'assemblage se retrouve en algèbre linéaire dans les méthodes multi-frontales ou dans les applications de simulations assemblant un système linéaire. Puis, nous concluons en dressant un bilan sur la plateforme AeroSol et donnons des pistes d'évolution possibles. / Numerical simulation is nowadays an essential part of engineering analysis, be it to design anew plane, or to detect underground oil reservoirs. Numerical simulations have indeed become an important complement to theoretical and experimental investigation, allowing one to reduce the cost of engineering design processes. In order to achieve a high level of precision, one need to increase the resolution of his computational domain. So to keep getting results in reasonable time, one shall nd a way to speed-up computations. To do this, we use high performance computing, HPC, to exploit the complex architecture of modern supercomputers. Under these two constraints, and some other like the genericity of finite elements, or the mesh dimension, we developed a new platform AeroSol. In this thesis, we present the mathematical background, and the two types of schemes that are implemented in the platform, the continuous finite elements method, and the discontinuous one. Then, we present the design choices made in the platform,then, we study a sub-problem, the assembly operation, which can be found in linear algebra multi-frontal methods. Calcul haute performance Eléments finis Support d'exécution HPC Finite element methods Runtime systems
7	Static guarantees for coordinated components : a statically typed composition model for stream-processing networks Penczek, Frank January 2012 (has links) Does your program do what it is supposed to be doing? Without running the program providing an answer to this question is much harder if the language does not support static type checking. Of course, even if compile-time checks are in place only certain errors will be detected: compilers can only second-guess the programmer’s intention. But, type based techniques go a long way in assisting programmers to detect errors in their computations earlier on. The question if a program behaves correctly is even harder to answer if the program consists of several parts that execute concurrently and need to communicate with each other. Compilers of standard programming languages are typically unable to infer information about how the parts of a concurrent program interact with each other, especially where explicit threading or message passing techniques are used. Hence, correctness guarantees are often conspicuously absent. Concurrency management in an application is a complex problem. However, it is largely orthogonal to the actual computational functionality that a program realises. Because of this orthogonality, the problem can be considered in isolation. The largest possible separation between concurrency and functionality is achieved if a dedicated language is used for concurrency management, i.e. an additional program manages the concurrent execution and interaction of the computational tasks of the original program. Such an approach does not only help programmers to focus on the core functionality and on the exploitation of concurrency independently, it also allows for a specialised analysis mechanism geared towards concurrency-related properties. This dissertation shows how an approach that completely decouples coordination from computation is a very supportive substrate for inferring static guarantees of the correctness of concurrent programs. Programs are described as streaming networks connecting independent components that implement the computations of the program, where the network describes the dependencies and interactions between components. A coordination program only requires an abstract notion of computation inside the components and may therefore be used as a generic and reusable design pattern for coordination. A type-based inference and checking mechanism analyses such streaming networks and provides comprehensive guarantees of the consistency and behaviour of coordination programs. Concrete implementations of components are deliberately left out of the scope of coordination programs: Components may be implemented in an external language, for example C, to provide the desired computational functionality. Based on this separation, a concise semantic framework allows for step-wise interpretation of coordination programs without requiring concrete implementations of their components. The framework also provides clear guidance for the implementation of the language. One such implementation is presented and hands-on examples demonstrate how the language is used in practice. 005.13
8	Safe Concurrent Programming and Execution Pyla, Hari Krishna 05 March 2013 (has links) The increasing prevalence of multi and many core processors has brought the issues of concurrency and parallelism to the forefront of everyday computing. Even for applications amenable to traditional parallelization techniques, the subtleties of concurrent programming are known to introduce concurrency bugs. Due to the potential of concurrency bugs, programmers find it hard to write correct concurrent code. To take full advantage of parallel shared memory platforms, application programmers need safe and efficient mechanisms that can support a wide range of parallel applications. In addition, a large body of applications are inherently hard-to-parallelize; their data and control dependencies impose execution order constraints that preclude the use of traditional parallelization techniques. Sensitive to their input data, a substantial number of applications fail to scale well, leaving cores idle. To improve the performance of such applications, application programmers need effective mechanisms that can fully leverage multi and many core architectures. These challenges stand in the way of realizing the true potential of emerging many core platforms. The techniques described in this dissertation address these challenges. Specifically, this dissertation contributes techniques to transparently detect and eliminate several concurrency bugs, including deadlocks, asymmetric write-write data races, priority inversion, live-locks, order violations, and bugs that stem from the presence of asynchronous signaling and locks. A second major contribution of this dissertation is a programming framework that exploits coarse-grain speculative parallelism to improve the performance of otherwise hard-to-parallelize applications. / Ph. D. Concurrent Programming Concurrency Bugs Program Analysis Runtime Systems Deadlock Detection and Recovery Speculative Parall
9	Scalable Task Parallel Programming in the Partitioned Global Address Space Dinan, James S. 02 September 2010 (has links) No description available. Computer Science High performance computing Parallel programming models Scalable runtime systems
10	Programming High-Performance Clusters with Heterogeneous Computing Devices Aji, Ashwin M. 19 May 2015 (has links) Today's high-performance computing (HPC) clusters are seeing an increase in the adoption of accelerators like GPUs, FPGAs and co-processors, leading to heterogeneity in the computation and memory subsystems. To program such systems, application developers typically employ a hybrid programming model of MPI across the compute nodes in the cluster and an accelerator-specific library (e.g.; CUDA, OpenCL, OpenMP, OpenACC) across the accelerator devices within each compute node. Such explicit management of disjointed computation and memory resources leads to reduced productivity and performance. This dissertation focuses on designing, implementing and evaluating a runtime system for HPC clusters with heterogeneous computing devices. This work also explores extending existing programming models to make use of our runtime system for easier code modernization of existing applications. Specifically, we present MPI-ACC, an extension to the popular MPI programming model and runtime system for efficient data movement and automatic task mapping across the CPUs and accelerators within a cluster, and discuss the lessons learned. MPI-ACC's task-mapping runtime subsystem performs fast and automatic device selection for a given task. MPI-ACC's data-movement subsystem includes careful optimizations for end-to-end communication among CPUs and accelerators, which are seamlessly leveraged by the application developers. MPI-ACC provides a familiar, flexible and natural interface for programmers to choose the right computation or communication targets, while its runtime system achieves efficient cluster utilization. / Ph. D. Runtime Systems Programming Models Message Passing Interface (MPI) CUDA OpenCL

Search results