121 |
Compiler Transformations For Improving The Performance Of Software Transactional MemoryMannarswamy, Sandya S 11 1900 (has links) (PDF)
Expressing synchronization using traditional lock based primitives has been found to be both error-prone and restrictive. Hence there has been considerable research work to develop scalable and programmer-friendly alternatives to lock-based synchronization. Atomic sections have been proposed as a programming idiom for expressing synchronization at a higher-level of abstraction than locks.
One way of supporting atomic sections in software is to rely on an underlying Software Transactional Memory (STM) implementation. While STM offers the promise of being a programming paradigm which is less error-prone and more programmer friendly compared to traditional lock-based synchronization, it also needs to be competitive in performance in order for it to be adopted in mainstream software. Unfortunately STMs do not meet the performance goals and are known to incur excessive performance overheads.
Prior work by other researchers and our performance analysis of STM applications show that conflicts and the resulting aborts are a major performance bottleneck for STM applications. Second we find that, supporting fine-grained optimistic concurrency can have significant impact on the cache behavior of applications running on STM and hence can adversely affect STM performance. Our systematic quantitative analysis of the cache behavior of STM applications as well as prior work on qualitative analysis of STM overheads show that cache overheads constitute a major performance bottleneck for STM applications. Hence in this thesis, we focus on addressing these two major STM performance bottlenecks.
Current STM implementations are typically application unaware in that they do not
analyze the application and use that knowledge to improve the application performance on STM. Closer integration of transactions with the programming languages opens up the possibility of using the compiler to analyze STM applications and using that knowledge to perform application code transformations to improve the application performance on STM automatically and in a manner transparent to the programmer. This motivated us to address the two major STM performance bottlenecks namely poor cache performance and performance penalty due to aborts, by compiler transformations.
In order to pinpoint the cache bottlenecks of STM, we perform a detailed experimental evaluation of the cache behavior of STM applications and quantify the impact of the different STM factors on the cache misses experienced by the applications. We propose a set of compiler transformations targeted to address the cache performance bottlenecks identified by our analysis. Next we turn our attention to compiler analysis and transformations targeted at reducing the performance overheads due to transactional aborts, effectively utilizing the compiler’s knowledge of the data access patterns of the application. Since not all applications are designed with optimistic concurrency in mind, real world applications typically contain certain atomic sections which are not amenable to STM’s optimistic concurrency control and hence suffer from excessive transactional abort overheads. We propose two compiler techniques for handling such atomic sections.
Another major cause of transactional conflicts leading to unnecessary aborts is the uniform granularity access tracking scheme employed by STM implementations. Using a single uniform access tracking granularity leads to poor lock assignment by STM. We propose techniques which use compiler’s knowledge of an application to improve the application unaware lock assignment made by the STM. Last as transactional abort overheads impact STM performance adversely, we propose a compiler-based approach to reduce the transactional abort overheads by reconciling certain kinds of transactions instead of aborting them and then performing a complete re-execution. We show that our combined set of compiler transformations are effective in improving the performance of a set of STAMP benchmarks by reducing the execution time by 7.48% to 54.82%, aborts by 8.98% to 56.61% and the average D-cache miss latency by up to 33.51%.
|
122 |
Adaptation automatique et semi-automatique des optimisations de programmes / Automatic and Semi-Automatic Adaptation of Program OptimizationsBagnères, Lénaïc 30 September 2016 (has links)
Les compilateurs offrent un excellent compromis entre le temps de développement et les performances de l'application. Actuellement l'efficacité de leurs optimisations reste limitée lorsque les architectures cibles sont des multi-cœurs ou si les applications demandent des calculs intensifs. Il est difficile de prendre en compte les nombreuses configurations existantes et les nombreux paramètres inconnus à la compilation et donc disponibles uniquement pendant l'exécution. En se basant sur les techniques de compilation polyédrique, nous proposons deux solutions complémentaires pour contribuer au traitement de ces problèmes. Dans une première partie, nous présentons une technique automatique à la fois statique et dynamique permettant d'optimiser les boucles des programmes en utilisant les possibilités offertes par l'auto-tuning dynamique. Cette solution entièrement automatique explore de nombreuses versions et sélectionne les plus pertinentes à la compilation. Le choix de la version à exécuter se fait dynamiquement avec un faible surcoût grâce à la génération de versions interchangeables: un ensemble de transformations autorisant le passage d'une version à une autre du programme tout en faisant du calcul utile. Dans une seconde partie, nous offrons à l'utilisateur une nouvelle façon d'interagir avec le compilateur polyédrique. Afin de comprendre et de modifier les transformations faites par l'optimiseur, nous traduisons depuis la représentation polyédrique utilisée en interne n'importe quelle transformation de boucles impactant l’ordonnancement des itérations en une séquence de transformations syntaxiques équivalente. Celle-ci est compréhensible et modifiable par les programmeurs. En offrant la possibilité au développeur d'examiner, modifier, améliorer, rejouer et de construire des optimisations complexes avec ces outils de compilation semi-automatiques, nous ouvrons une boîte noire du compilateur: celle de la plateforme de compilation polyédrique. / Compilers usually offer a good trade-off between productivity and single thread performance thanks to a wide range of available automatic optimizations. However, they are still fragile when addressing computation intensive parts of applications in the context of parallel architectures with deep memory hierarchies that are now omnipresent. The recent shift to multicore architectures for desktop and embedded systems as well as the emergence of cloud computing is raising the problem of the impact of the execution context on performance. Firstly, we present a static-dynamic compiler optimization technique that generates loop-based programs with dynamic auto-tuning capabilities with very low overhead. Our strategy introduces switchable scheduling, a family of program transformations that allows to switch between optimized versions while always processing useful computation. We present both the technique to generate self-adaptive programs based on switchable scheduling and experimental evidence of their ability to sustain high-performance in a dynamic environment.Secondly, we propose a novel approach which aims at opening the polyhedral compilation engines that are powering loop-level optimization and parallelization frameworks of several modern compilers.Building on the state-of-the-art polyhedral representation of programs, we present ways to translate comprehensible syntactic transformation sequences to and from the internal polyhedral compiler abstractions. This new way to interact with high-level optimization frameworks provides an invaluable feedback to programmers with the ability to design, replay or refine loop-level compiler optimizations.
|
123 |
Modifying Instruction Sets In The Gem5 Simulator To Support Fault Tolerant DesignsZhang, Chuan 23 November 2015 (has links)
Traditional fault tolerant techniques such as hardware or time redundancy incur high overhead and are inefficient for checking arithmetic operations. Our objective is to study an alternative approach of adding new instructions to check arithmetic operations. These checking instructions either rely on error detecting code or calculate approximate results and consequently, consume much less execution time. To evaluate the effectiveness of such an approach we wish to modify several benchmarks to use checking instructions and run simulation experiments to find out their execution time and memory usage. However, the checking instructions are not included in the instruction set and as a result, are not supported by current architecture simulators. Therefore, another objective of this thesis is to develop a method for inserting new instructions in the Gem5 simulator and cross compiler. The insertion process is integrated into a software tool called Gtool. Gtool can add an error checking capability to C programs by using the new instructions.
|
124 |
Compacting Loads and Stores for Code Size ReductionAsay, Isaac 01 March 2014 (has links)
It is important for compilers to generate executable code that is as small as possible, particularly when generating code for embedded systems. One method of reducing code size is to use instruction set architectures (ISAs) that support combining multiple operations into single operations. The ARM ISA allows for combining multiple memory operations to contiguous memory addresses into a single operation. The LLVM compiler contains a specific memory optimization to perform this combining of memory operations, called ARMLoadStoreOpt. This optimization, however, relies on another optimization (ARMPreAllocLoadStoreOpt) to move eligible memory operations into proximity in order to perform properly. This mover optimization occurs before register allocation, while ARMLoadStoreOpt occurs after register allocation. This thesis implements a similar mover optimization (called MagnetPass) after register allocation is performed, and compares this implementation with the existing optimization. While in most cases the two optimizations provide comparable results, our implementation in its current state requires some improvements before it will be a viable alternative to the existing optimization. Specifically, the algorithm will need to be modified to reduce computational complexity, and our implementation will need to take care not to interfere with other LLVM optimizations.
|
125 |
Port OS Linux pro signalový procesor DaVinci / Linux port for DaVinci DSP familyŠujak, Marek January 2009 (has links)
The main subject of this thesis is implementation of operating system based on Linux kernel. The thesis demonstrates ARM architecture and the programming model of this architecture very briefly. Then it explains initialization process of most significant peripherals including serial interfaces and memory controllers and describes booting process form NAND, NOR memory and UART interface. At the end it shows compiling procedures necessary for building u-boot and kernel.
|
126 |
Překladač podmnožiny jazyka Python / A Compiler of Language Python SubsetFalhar, Radek January 2014 (has links)
Python is dynamically typed interpreted programming language. Thanks to its dynamic type system, it is difficult to compile it into statically typed source code. The kind of source code, where it is exactly specified what types exist and what their structure is. Multiple approaches exist how to achieve this and one of the primary ones is type inference. This approach is attempting to infer the type structure from the source code. In case of Python language, this approach is difficult, because resulting type system is quite complex and language itself is not designed for type inference. In this work, I have focused on identifying subset of this language, so that type inference is possible while keeping the natural way the language is used. Then I implemented a compiler, which will compile this subset into statically typed language, which can be translated into native code.
|
127 |
Compilation and Generation of Multi-Processor on a Chip Real-Time Embedded SystemsKlingler, Randall S. 10 July 2007 (has links) (PDF)
Current FPGA technology has advanced to the point that useful embedded System-on-Programmable-Chips (SoPC)s can now be designed. The Real Time Processor (RTP) project leverages the advances in FPGA technology with a system architecture that is customizable to specific real-time applications. The design and implementation of the framework for architecting such a system from ANSI-C code is presented. The Small Device C Compiler (SDCC) was retargeted to the RTP architecture and extended to produce a generator directive file. The RTPGen hardware generator was created to consume the directive file and produce a highly customized top-level structural VHDL file that can be synthesized and programmed onto an FPGA such as the Xilinx Spartan-3. Thus, an application specific multiprocessor real-time embedded system is realized from ANSI-C code.
|
128 |
SRAM Compiler for Automated Memory Layout Supporting Multiple Transistor Process TechnologiesHilgers, Brandon 01 July 2015 (has links) (PDF)
This research details the design of an SRAM compiler for quickly creating SRAM blocks for Cal Poly integrated circuit (IC) designs. The compiler generates memory for two process technologies (IBM 180nm cmrf7sf and ON Semiconductor 600nm SCMOS) and requires a minimum number of specifications from the user for ease of use, while still offering the option to customize the performance for speed or area of the generated SRAM cell. By automatically creating SRAM arrays, the compiler saves the user time from having to layout and test memory and allows for quick updates and changes to a design. Memory compilers with various features already exist, but they have several disadvantages. Most memory compilers are expensive, usually only generate memory for one process technology, and don’t allow for user-defined custom SRAM cell optimizations. This free design makes it available for students and institutions that would not be able to afford an industry-made compiler. A compiler that offers multiple process technologies allows for more freedom to design in other processes if needed or desired. An attempt was made for this design to be modular for different process technologies so new processes could be added with ease; however, different process technologies have different DRC rules, making that option very difficult to attain. A customizable SRAM cell based on transistor sizing ratios allows for optimized designs in speed, area, or power, and for academic research. Even for an experienced designer, the layout of a single SRAM cell (1 bit) can take an hour. This command-line-based tool can draw a 1Kb SRAM block in seconds and a 1Mb SRAM block in about 15 minutes. In addition, this compiler also adds a manually laid out precharge circuit to each of the SRAM columns for an enhanced read operation by ensuring the bit lines have valid logic output values. Finally, an analysis on SRAM cell stability is done for creating a robust cell as the default design for the compiler. The default cell design is verified for stability during read and write operations, and has an area of 14.067 µm2 for the cmrf7sf process and 246.42 µm2 for the SCMOS process. All factors considered, this SRAM compiler design overcomes several of the drawbacks of other existing memory compilers.
|
129 |
An Optimization Compiler Framework Based on Polyhedron Model for GPGPUsLiu, Lifeng 31 May 2017 (has links)
No description available.
|
130 |
A Just in Time Register Allocation and Code Optimization Framework for Embedded SystemsThammanur, Sathyanarayan 11 October 2001 (has links)
No description available.
|
Page generated in 0.0396 seconds