Global ETD Search

1	Compiler-Directed Error Resilience for Reliable Computing Liu, Qingrui 08 August 2018 (has links) Error resilience has become as important as power and performance in modern computing architecture. There are various sources of errors that can paralyze real-world computing systems. Of particular interest to this dissertation are single-event errors. They can be the results of energetic particle strike or abrupt power outage that corrupts the program states leading to system failures. Specifically, energetic particle strike is the major cause of soft error while abrupt power outage can result in memory inconsistency in the nonvolatile memory systems. Unfortunately, existing techniques to handle those single-event errors are either resource consuming (e.g., hardware approaches) or heavy-weight (e.g., software approaches). To address this problem, this dissertation identifies idempotent processing as an alternative recovery technique to handle the system failures in an efficient and low-cost manner. Then, this dissertation first proposes to design and develop a compiler-directed lightweight methodology which leverages idempotent processing and the state-of-the-art sensor-based detection to achieve soft error resilience at low-cost. This dissertation also introduces a lightweight soft error tolerant hardware design that redefines idempotent processing where the idempotent regions can be created, verified and recovered from the processor's point of view. Furthermore, this dissertation proposes a series of compiler optimizations that significantly reduce the hardware and runtime overhead of the idempotent processing. Lastly, this dissertation proposes a failure-atomic system integrated with idempotent processing to resolve another type of single-event error, i.e., failure-induced memory inconsistency in the nonvolatile memory systems. / Ph. D. / Our computing systems are vulnerable to different kinds of errors. All these errors can potentially crash real-world computing systems. This dissertation specifically addresses the challenges of single-event errors. Single-event errors can be caused by energetic particle strikes or abrupt power outage that can corrupt the program states leading to system failures. Unfortunately, existing techniques to handle those single-event errors are expensive in terms of hardware/software. To address this problem, this dissertation leverages an interesting property called idempotence in the program. A region of code is idempotent if and only if it always generates the same output whenever the program jumps back to the region entry from any execution point within the region. Thus, we can leverage the idempotent property as a low-cost recovery technique to recover the system failures by jumping back to the beginning of the region where the errors occur. This dissertation proposes solutions to incorporate the idempotent property for resilience against those single-event errors. Furthermore, this dissertation introduces a series of optimization techniques with compiler and hardware support to improve the efficiency and overheads for error resilience. We believe that our proposed techniques in this dissertation can inspire researchers for future error resilience research. Reliability Compiler Optimization Computer Architecture
2	An empirical study of the influence of compiler optimizations on symbolic execution Dong, Shiyu 18 September 2014 (has links) Compiler optimizations in the context of traditional program execution is a well-studied research area, and modern compilers typically offer a suite of optimization options. This thesis reports the first study (to our knowledge) on how standard compiler optimizations influence symbolic execution. We study 33 optimization flags of the LLVM compiler infrastructure, which are used by the KLEE symbolic execution engine. Specifically, we study (1) how different optimizations influence the performance of KLEE for Unix Coreutils, (2) how the influence varies across two different program classes, and (3) how the influence varies across three different back-end constraint solvers. Some of our findings surprised us. For example, KLEE's setting for applying the 33 optimizations in a pre-defined order provides sub-optimal performance for a majority of the Coreutils when using the basic depth-first search; moreover, in our experimental setup, applying no optimization performs better for many of the Coreutils. / text Compiler optimization Symbolic execution KLEE LLVM
3	Unconventional Applications of Compiler Analysis Selby, Jason W. A. January 2011 (has links) Previously, compiler transformations have primarily focused on minimizing program execution time. This thesis explores some examples of applying compiler technology outside of its original scope. Specifically, we apply compiler analysis to the field of software maintenance and evolution by examining the use of global data throughout the lifetimes of many open source projects. Also, we investigate the effects of compiler optimizations on the power consumption of small battery powered devices. Finally, in an area closer to traditional compiler research we examine automatic program parallelization in the form of thread-level speculation. Compiler Analysis Compiler Optimization Computer Science
4	Unconventional Applications of Compiler Analysis Selby, Jason W. A. January 2011 (has links) Previously, compiler transformations have primarily focused on minimizing program execution time. This thesis explores some examples of applying compiler technology outside of its original scope. Specifically, we apply compiler analysis to the field of software maintenance and evolution by examining the use of global data throughout the lifetimes of many open source projects. Also, we investigate the effects of compiler optimizations on the power consumption of small battery powered devices. Finally, in an area closer to traditional compiler research we examine automatic program parallelization in the form of thread-level speculation. Compiler Analysis Compiler Optimization Computer Science
5	Impacts of Compiler Optimizations on Address Bus Energy: An Empirical Study TOMIYAMA, Hiroyuki 01 October 2004 (has links) No description available. bus encoding low energy embedded systems compiler optimization
6	Constraint Programming Techniques for Optimal Instruction Scheduling Malik, Abid 03 1900 (has links) Modern processors have multiple pipelined functional units and can issue more than one instruction per clock cycle. This puts great pressure on the instruction scheduling phase in a compiler to expose maximum instruction level parallelism. Basic blocks and superblocks are commonly used regions of code in a program for instruction scheduling. Instruction scheduling coupled with register allocation is also a well studied problem to produce better machine code. Scheduling basic blocks and superblocks optimally with or with out register allocation is NP-complete, and is done sub-optimally in production compilers using heuristic approaches. In this thesis, I present a constraint programming approach to the superblock and basic block instruction scheduling problems for both idealized and realistic architectures. Basic block scheduling with register allocation with no spilling allowed is also considered. My models for both basic block and superblock scheduling are optimal and fast enough to be incorporated into production compilers. I experimentally evaluated my optimal schedulers on the SPEC 2000 integer and floating point benchmarks. On this benchmark suite, the optimal schedulers were very robust and scaled to the largest basic blocks and superblocks. Depending on the architectural model, between 99.991\% to 99.999\% of all basic blocks and superblocks were solved to optimality. The schedulers were able to routinely solve the largest blocks, including blocks with up to 2600 instructions. My results compare favorably to the best previous optimal approaches, which are based on integer programming and enumeration. My approach for basic block scheduling without allowing spilling was good enough to solve 97.496\% of all basic blocks in the SPEC 2000 benchmark. The approach was able to solve basic blocks as large as 50 instructions for both idealized and realistic architectures within reasonable time limits. Again, my results compare favorably to recent work on optimal integrated code generation, which is based on integer programming. Constraint Programming Compiler Optimization Instruction Scheduling Register Allocation Computer Science
7	Constraint Programming Techniques for Optimal Instruction Scheduling Malik, Abid 03 1900 (has links) Modern processors have multiple pipelined functional units and can issue more than one instruction per clock cycle. This puts great pressure on the instruction scheduling phase in a compiler to expose maximum instruction level parallelism. Basic blocks and superblocks are commonly used regions of code in a program for instruction scheduling. Instruction scheduling coupled with register allocation is also a well studied problem to produce better machine code. Scheduling basic blocks and superblocks optimally with or with out register allocation is NP-complete, and is done sub-optimally in production compilers using heuristic approaches. In this thesis, I present a constraint programming approach to the superblock and basic block instruction scheduling problems for both idealized and realistic architectures. Basic block scheduling with register allocation with no spilling allowed is also considered. My models for both basic block and superblock scheduling are optimal and fast enough to be incorporated into production compilers. I experimentally evaluated my optimal schedulers on the SPEC 2000 integer and floating point benchmarks. On this benchmark suite, the optimal schedulers were very robust and scaled to the largest basic blocks and superblocks. Depending on the architectural model, between 99.991\% to 99.999\% of all basic blocks and superblocks were solved to optimality. The schedulers were able to routinely solve the largest blocks, including blocks with up to 2600 instructions. My results compare favorably to the best previous optimal approaches, which are based on integer programming and enumeration. My approach for basic block scheduling without allowing spilling was good enough to solve 97.496\% of all basic blocks in the SPEC 2000 benchmark. The approach was able to solve basic blocks as large as 50 instructions for both idealized and realistic architectures within reasonable time limits. Again, my results compare favorably to recent work on optimal integrated code generation, which is based on integer programming. Constraint Programming Compiler Optimization Instruction Scheduling Register Allocation Computer Science
8	An Environment for Automatic Generation of Code Optimizers Paleri, Vineeth Kumar 07 1900 (has links) Code optimization or code transformation is a complex function of a compiler involving analyses and modifications with the entire program as its scope. In spite of its complexity, hardly any tools exist to support this function of the compiler. This thesis presents the development of a code transformation system, specifically for scalar transformations, which can be used either as a tool to assist the generation of code transformers or as an environment for experimentation with code transformations. The development of the code transformation system involves the formal specification of code transformations using dependence relations. We have written formal specifications for the whole class of traditional scalar transformations, including induction variable elimination - a complex transformation - for which no formal specifications are available in the literature. All transformations considered in this thesis are global. Most of the specifications given here, for which specifications are already available in the literature, are improved versions, in terms of conservativeness.The study of algorithms for code transformations, in the context of their formal specification, lead us to the development of a new algorithm for partial redundancy elimination. The basic idea behind the algorithm is the new concepts of safe partial availability and safe partial anticipability. Our algorithm is computationally and lifetime optimal. It works on flow graphs whose nodes are basic blocks, which makes it practical.In comparison with existing algorithms the new algorithm also requires four unidirectional analyses, but saves some preprocessing time. The main advantage of the algorithm is its conceptual simplicity. The code transformation system provides an environment in which one can specify a transformation using dependence relations (in the specification language we have designed), generate code for a transformer from its specification,and experiment with the generated transformers on real-world programs. The system takes a program to be transformed, in C or FORTRAN, as input,translates it into intermediate code, interacts with the user to decide the transformation to be performed, computes the necessary dependence relations using the dependence analyzer, applies the specified transformer on the intermediate code, and converts the transformed intermediate code back to high-level. The system is unique of its kind,providing a complete environment for the generation of code transformers, and allowing experimentations with them using real-world programs. Computer and Information Science Compiler optimization Automatic Generation Code transformation
9	Multilevel tiling for non-rectangular interation spaces Jiménez Castells, Marta 28 May 1999 (has links) La motivación principal de esta tesis es el desarrollo de nuevas técnicas de compilación dirigidas a conseguir mayor rendimiento encódigos numéricos complejos que definen es pacios de iteraciones no rectangulares. En particular, nos centramos en la trasformación de "loop tiling" (también conocida como "blocking") y nuestro propósito es mejorar la transformación de loop tiling cuando se aplica a códigos numéricos complejos. Nuestro objetivo es conseguir, a través de la transformación de loop tiling, los mismos o mejores rendimientos que las librerías numéricas proporcionadas por el fabricante que están optimizadas manualmente.En la tesis se muestra que la razón principal por la que los compiladores comerciales actuales consiguen bajos rendimiento en este tipo de aplicaciones es que no son capaces de aplicar loop tiling a nivel de registros. En su lugar, para mejorar la localidad de los datos y el ILP, los compiladores actuales usan y combinan otras transformaciones que no explotan el nivel de registros tan bien como loop tiling. Previamente no se ha considerado aplicar loop tiling a nivel de registro porque en códigos numéricos complejos no es trivial debido a la naturaleza irregular de los espacios de iteraciones. La primera contribución de esta tesis es un algoritmo general de loop tiling a nivel de registros que es aplicable a cualquier tipo de espacio de iteraciones y no sólo a los espacios rectangulares. Nuestro método incluye una heurística muy sencilla para decidir los parámetros de los cortes a nivel de registros. A primera vista parece que loop tiling a nivel de registros (a partir de ahora, register tiling) se tiene que aplicar de tal manera que el bucle que ofrece más reuso temporal de los datos no debe de ser partido. De esta manera maximizamos la reutilización de los registros y minimizamos el número total de load/stores ejecutados. Sin embargo, mostraremos que en espacios de iteraciones no rectangulares, si solamente tenemos en cuenta las direcciones del reuso y no la forma del espacio de iteraciones, los códigos pueden sufrir una degradación en rendimiento. Nuestra segunda contribución es la propuesta de una heurística muy sencilla que determina los parámetros del tiling a nivel de registros considerando no sólo el reuso temporal sino también la forma del espacio de iteraciones. Además, la heurística es suficientemente sencilla para que pueda ser implementada en un compilador comercial.Sin embargo, para conseguir rendimientos similares que códigos optimizados a mano, no es suficiente con aplicar loop tiling a nivel de registros. Con las arquitecturas de hoy en día que disponen de jerarquías de memoria complejas y múltiples procesadores, es necesario que el compilador aplique loop tiling en cuatro o más niveles (paralelismo, cache L2, cache L1 y registros) para conseguir altos rendimientos. Por lo tanto, en las arquitecturas actuales es crucial tener un algoritmo eficiente para aplicar loop tiling en varios niveles de la jerarquía de memoria (tiling multinivel). Además, como mostramos en esta tesis, la transformación de tiling multinivel siempre tendrá que incluir el nivel de registro porque este es el nivel de la jerarquía de memoria que ofrece mayor rendimiento cuando es tratado correctamente.Cuando tiling multinivel incluye el nivel de registros, es necesario que los límites de los bucles sean exactos y que no haya límites redundantes. La razón es que la complejidad y la cantidad de código que se genera con nuestra técnica de register tiling depende polinómicamente del número de límites de los bucles.Sin embargo, hasta ahora, el problema de calcular límites exactos y eliminar límites redundantes es que todas las técnicas conocidas son muy caras en términos de tiempo de compilación y, por ello, difícil de integrar en un compilador comercial. La tercera contribución de esta tesis es una nueva implementación de tiling multinivel que calcula límites exactos y es mucho menos costosa que técnicas tradicionales. Mostraremos que la complejidad de nuestra implementación es proporcional a la complejidad de aplicar una permutación de bucles en el código original (antes de aplicar loop tiling), mientras que las técnicas tradicionales tienen complejidades más altas. Además, nuestra implementación genera menos límites redundantes y permite eliminar los límites redundantes que quedan a menor coste. En conjunto, la eficiencia de nuestra implementación hace posible que pueda ser implementada dentro de un compilador comercial sin tener que preocuparse por los tiempos de compilación.La última parte de esta tesis está dedicada al estudio del rendimiento de tiling multinivel. Se muestran los efectos de tiling en los diferentes niveles de memoria y presentamos datos que comparan los beneficios de tiling a nivel de registros, tiling a nivel de cache y tiling a los dos niveles, cache y registros, simultáneamente. Finalmente, comparamos el rendimiento de códigos optimizados automáticamente con códigos optimizados manualmente (librerías numéricas que ofrecen los fabricantes) sobre dos arquitecturas diferentes (ALPHA 21164 and MIPS R10000) para concluir que actualmente la tecnología de los compiladores hace posible que códigos numéricos complejos consigan el mismo rendimiento que códigos optimizados manualmente. / The main motivation of this thesis is to develop new compilation techniques that address the lack of performance of complex numerical codes consisting of loop nests defining non-rectangular iteration spaces. Specifically, we focus on the loop tiling transformation (also known as blocking) and our purpose is the improvement of the loop tiling transformation when dealing with complex numerical codes. Our goal is to achieve via the loop tiling transformation the same or better performance as hand-optimized vendor-supplied numerical libraries. We will observe that the main reason why current commercial compilers perform poorly when dealing with this type of codes is that they do not apply tiling for the register level. Instead, to enhance locality at this level and to improve ILP, they use/combine other transformations that do not exploit the register level as well as loop tiling. Tiling for the register level has not generally been considered because, in complex numerical codes, it is far from being trivial due to the irregular nature of the iteration space. Our first contribution in this thesis will be a general compiler algorithm to perform tiling at the register level that handles arbitrary iteration space shapes and not only simple rectangular shapes.Our method includes a very simple heuristic to make the tile decisions for the register level. At first sight, register tiling should be performed so that whichever loop carries the most temporal reuse is not tiled. This way, register reuse is maximized and the number of load/store instructions executed is minimized. However, we will show that, for complex loop nests, if we only consider reuse directions and do not take into account the iteration space shape, the tiled loop nest can suffer performance degradation. Our second contribution will be a proposal of a very simple heuristic to determine the tiling parameters for the register level, that considers not only temporal reuse, but also the iteration space shape. Moreover, the heuristic is simple enough to be suitable for automatic implementation by compilers.However, to be able to achieve similar performance to hand-optimized codes, it is not enough by tiling only for the register level. With today's architectures having complex memory hierarchies and multiple processors, it is quite common that the compiler has to perform tiling at four or more levels (parallelism, L2-cache, L1-cache and registers) in order to achieve high performance. Therefore, in today's architectures it is crucial to have an efficient algorithm that can perform multilevel tiling at multiple levels of the memory hierarchy. Moreover, as we will see in this thesis, multilevel tiling should always include the register level, as this is the memory hierarchy level that yields most performance when properly tiled.When multilevel tiling includes the register level, it is critical to compute exact loop bounds and to avoid the generation of redundant bounds. The reason is that the complexity and the amount of code generated by our register tiling technique both depend polynomially on the number of loop bounds. However, to date, the drawback of generating exact loop bounds and eliminating redundant bounds has been that all techniques known were extremely expensive in terms of compilation time and, thus, difficult to integrate in a production compiler. Our third contribution in this thesis will be a new implementation of multilevel tiling that computes exact loop bounds at a much lower complexity than traditional techniques. In fact, we will show that the complexity of our implementation is proportional to the complexity of performing a loop permutation in the original loop nest (before tiling), while traditional techniques have much larger complexities. Moreover, our implementation generates less redundant bounds in the multilevel tiled code and allows removing the remaining redundant bounds at a lower cost. Overall, the efficiency of our implementation makes it possible to integrate multilevel tiling including the register level in a production compiler without having to worry about compilation time.The last part of this thesis is dedicated to studying the performance of multilevel tiling. We will discuss the effects of tiling for different memory levels and present quantitative data comparing the benefits of tiling only for the register level, tiling only for the cache level and tiling for both levels simultaneously. Finally, we will compare automatically-optimized codes against hand-optimized vendor-supplied numerical libraries, on two different architectures (ALPHA 21164 and MIPS R10000), to conclude that compiler technology can make it possible for complex numerical codes to achieve the same performance as hand-optimized codes on modern microprocessors. register tiling memory hierarchy compiler optimization multilevel tiling
10	Automatic Tuning of Scientific Applications Qasem, Apan January 2007 (has links) Over the last several decades we have witnessed tremendous change in the landscape of computer architecture. New architectures have emerged at a rapid pace with computing capabilities that have often exceeded our expectations. However, the rapid rate of architectural innovations has also been a source of major concern for the high-performance computing community. Each new architecture or even a new model of a given architecture has brought with it new features that have added to the complexity of the target platform. As a result, it has become increasingly difficult to exploit the full potential of modern architectures for complex scientific applications. The gap between the theoretical peak and the actual achievable performance has increased with every step of architectural innovation. As multi-core platforms become more pervasive, this performance gap is likely to increase. To deal with the changing nature of computer architecture and its ever increasing complexity, application developers laboriously retarget code, by hand, which often costs many person-months even for a single application. To address this problem, we developed a software-based strategy that can automatically tune applications to different architectures to deliver portable high-performance. This dissertation describes our automatic tuning strategy. Our strategy combines architecture-aware cost models with heuristic search to find the most suitable optimization parameters for the target platform. The key contribution of this work is a novel strategy for pruning the search space of transformation parameters. By focusing on architecture-dependent model parameters instead of transformation parameters themselves, we show that we can dramatically reduce the size of the search space and yet still achieve most of the benefits of the best tuning possible with exhaustive search. We present an evaluation of our strategy on a set of scientific applications and kernels on several different platforms. The experimental results presented in this dissertation suggest that our approach can produce significant performance improvement on a range of architectures at a cost that is not overly demanding. loop fusion architecture-aware compiler optimization automatic tuning

Search results