1 |
Speeding up dynamic compilation : concurrent and parallel dynamic compilationBohm, Igor January 2013 (has links)
The main challenge faced by a dynamic compilation system is to detect and translate frequently executed program regions into highly efficient native code as fast as possible. To efficiently reduce dynamic compilation latency, a dynamic compilation system must improve its workload throughput, i.e. compile more application hotspots per time. As time for dynamic compilation adds to the overall execution time, the dynamic compiler is often decoupled and operates in a separate thread independent from the main execution loop to reduce the overhead of dynamic compilation. This thesis proposes innovative techniques aimed at effectively speeding up dynamic compilation. The first contribution is a generalised region recording scheme optimised for program representations that require dynamic code discovery (e.g. binary program representations). The second contribution reduces dynamic compilation cost by incrementally compiling several hot regions in a concurrent and parallel task farm. Altogether the combination of generalised light-weight code discovery, large translation units, dynamic work scheduling, and concurrent and parallel dynamic compilation ensures timely and efficient processing of compilation workloads. Compared to state-of-the-art dynamic compilation approaches, speedups of up to 2.08 are demonstrated for industry standard benchmarks such as BioPerf, Spec Cpu 2006, and Eembc. Next, innovative applications of the proposed dynamic compilation scheme to speed up architectural and micro-architectural performance modelling are demonstrated. The main contribution in this context is to exploit runtime information to dynamically generate optimised code that accurately models architectural and micro-architectural components. Consequently, compilation units are larger and more complex resulting in increased compilation latencies. Large and complex compilation units present an ideal use case for our concurrent and parallel dynamic compilation infrastructure. We demonstrate that our novel micro-architectural performance modelling is faster than state-of-the-art Fpga-based simulation, whilst providing the same level of accuracy.
|
2 |
A Just in Time Register Allocation and Code Optimization Framework for Embedded SystemsThammanur, Sathyanarayan 11 October 2001 (has links)
No description available.
|
3 |
Générateur de code multi-temps et optimisation de code multi-objectifs / Multi-time code generation and multi-objective code optimisationLomüller, Victor 12 November 2014 (has links)
La compilation est une étape indispensable dans la création d'applications performantes.Cette étape autorise l'utilisation de langages de haut niveau et indépendants de la cible tout en permettant d'obtenir de bonnes performances.Cependant, de nombreux freins empêchent les compilateurs d'optimiser au mieux les applications.Pour les compilateurs statiques, le frein majeur est la faible connaissance du contexte d'exécution, notamment sur l'architecture et les données utilisées.Cette connaissance du contexte se fait progressivement pendant le cycle de vie de l'application.Pour tenter d'utiliser au mieux les connaissances du contexte d'exécution, les compilateurs ont progressivement intégré des techniques de génération de code dynamique.Cependant ces techniques ne se focalisent que sur l'utilisation optimale du matériel et n'utilisent que très peu les données.Dans cette thèse, nous nous intéressons à l'utilisation des données dans le processus d'optimisation d'applications pour GPU Nvidia.Nous proposons une méthode utilisant différents moments pour créer des bibliothèques adaptatives capables de prendre en compte la taille des données.Ces bibliothèques peuvent alors fournir les noyaux de calcul les plus adapté au contexte.Sur l'algorithme de la GEMM, la méthode permet d'obtenir des gains pouvant atteindre 100~\% tout en évitant une explosion de la taille du code.La thèse s'intéresse également aux gains et coûts de la génération de code lors de l'exécution, et ce du point de vue de la vitesse d'exécution, de l'empreinte mémoire et de la consommation énergétique.Nous proposons et étudions 2 approches de génération de code à l'exécution permettant la spécialisation de code avec un faible surcoût.Nous montrons que ces 2 approches permettent d'obtenir des gains en vitesse et en consommation comparables, voire supérieurs, à LLVM mais avec un coût moindre. / Compilation is an essential step to create efficient applications.This step allows the use of high-level and target independent languages while maintaining good performances.However, many obstacle prevent compilers to fully optimize applications.For static compilers, the major obstacle is the poor knowledge of the execution context, particularly knowledge on the architecture and data.This knowledge is progressively known during the application life cycle.Compilers progressively integrated dynamic code generation techniques to be able to use this knowledge.However, those techniques usually focuses on improvement of hardware capabilities usage but don't take data into account.In this thesis, we investigate data usage in applications optimization process on Nvidia GPU.We present a method that uses different moments in the application life cycle to create adaptive libraries able to take into account data size.Those libraries can therefore provide more adapted kernels.With the GEMM algorithm, the method is able to provide gains up to 100~\% while avoiding code size explosion.The thesis also investigate runtime code generation gains and costs from the execution speed, memory footprint and energy consumption point of view.We present and study 2 light-weight runtime code generation approaches that can specialize code.We show that those 2 approaches can obtain comparable, and even superior, gains compared to LLVM but at a lower cost.
|
4 |
Runtime specialization for heterogeneous CPU-GPU platformsFarooqui, Naila 27 May 2016 (has links)
Heterogeneous parallel architectures like those comprised of CPUs and GPUs are a tantalizing compute fabric for performance-hungry developers. While these platforms enable order-of-magnitude performance increases for many data-parallel application domains, there remain several open challenges: (i) the distinct execution models inherent in the heterogeneous devices present on such platforms drives the need to dynamically match workload characteristics to the underlying resources, (ii) the complex architecture and programming models of such systems require substantial application knowledge and effort-intensive program tuning to achieve high performance, and (iii) as such platforms become prevalent, there is a need to extend their utility from running known regular data-parallel applications to the broader set of input-dependent, irregular applications common in enterprise settings. The key contribution of our research is to enable runtime specialization on such hybrid CPU-GPU platforms by matching application characteristics to the underlying heterogeneous resources for both regular and irregular workloads. Our approach enables profile-driven resource management and optimizations for such platforms, providing high application performance and system throughput. Towards this end, this research: (a) enables dynamic instrumentation for GPU-based parallel architectures, specifically targeting the complex Single-Instruction Multiple-Data (SIMD) execution model, to gain real-time introspection into application behavior; (b) leverages such dynamic performance data to support novel online resource management methods that improve application performance and system throughput, particularly for irregular, input-dependent applications; (c) automates some of the programmer effort required to exercise specialized architectural features of such platforms via instrumentation-driven dynamic code optimizations; and (d) proposes a specialized, affinity-aware work-stealing scheduling runtime for integrated CPU-GPU processors that efficiently distributes work across all CPU and GPU cores for improved load balance, taking into account both application characteristics and architectural differences of the underlying devices.
|
5 |
A model of dynamic compilation for heterogeneous compute platformsKerr, Andrew 10 December 2012 (has links)
Trends in computer engineering place renewed emphasis on increasing parallelism and heterogeneity.
The rise of parallelism adds an additional dimension to the challenge of portability, as
different processors support different notions of parallelism, whether vector parallelism executing
in a few threads on multicore CPUs or large-scale thread hierarchies on GPUs. Thus, software
experiences obstacles to portability and efficient execution beyond differences in instruction sets;
rather, the underlying execution models of radically different architectures may not be compatible.
Dynamic compilation applied to data-parallel heterogeneous architectures presents an abstraction
layer decoupling program representations from optimized binaries, thus enabling portability without
encumbering performance. This dissertation proposes several techniques that extend dynamic
compilation to data-parallel execution models. These contributions include:
- characterization of data-parallel workloads
- machine-independent application metrics
- framework for performance modeling and prediction
- execution model translation for vector processors
- region-based compilation and scheduling
We evaluate these claims via the development of a novel dynamic compilation framework,
GPU Ocelot, with which we execute real-world workloads from GPU computing. This enables
the execution of GPU computing workloads to run efficiently on multicore CPUs, GPUs, and a
functional simulator. We show data-parallel workloads exhibit performance scaling, take advantage
of vector instruction set extensions, and effectively exploit data locality via scheduling which
attempts to maximize control locality.
|
6 |
Beyond the realm of the polyhedral model : combining speculative program parallelization with polyhedral compilation / Au-delà des limites du modèle polyédrique : l'alliage de la parallélisation spéculative de programmes avec la compilation polyédriqueSukumaran Rajam, Aravind 05 November 2015 (has links)
Dans cette thèse, nous présentons nos contributions à Apollo (Automatic speculative POLyhedral Loop Optimizer), qui est un compilateur automatique combinant la parallélisation spéculative et le modèle polyédrique, afin d’optimiser les codes à la volée. En effectuant une instrumentation partielle au cours de l’exécution, et en la soumettant à une interpolation, Apollo est capable de construire un modèle polyédrique spéculatif dynamiquement. Ce modèle spéculatif est ensuite transmis à Pluto, qui est un ordonnanceur polyédrique statique. Apollo sélectionne ensuite un des squelettes d’optimisation de code générés statiquement, et l’instancie. La partie dynamique d’Apollo surveille continuellement l’exécution du code afin de détecter de manière dé- centralisée toute violation de dépendance. Une autre contribution importante de cette thèse est notre extension du modèle polyédrique aux codes exhibant un comportement non-linéaire. Grâce au contexte dynamique et spéculatif d’Apollo, les comportements non-linéaires sont soit modélisés par des hyperplans de régression linéaire formant des tubes, soit par des intervalles de valeurs atteintes. Notre approche permet l’application de transformations polyédriques à des codes non-linéaires grâce à un système de vérification de la spéculation hybride, combinant vérifications centralisées et décentralisées. / In this thesis, we present our contributions to APOLLO (Automatic speculative POLyhedral Loop Optimizer), which is an automated compiler combining Thread Level Speculation (TLS) and the polyhedral model to optimize codes on the fly. By doing partial instrumentation at runtime, and subjecting it to interpolation, Apollo is able to construct a speculative polyhedral model dynamically. The speculative model is then passed to Pluto -a static polyhedral scheduler-. Apollo then selects one of the statically generated code optimization skeletons and instantiates it. The runtime continuously monitors the code for any dependence violation in a decentralized manner. Another important contribution of this thesis is our extension of the polyhedral model to codes exhibiting a non linear behavior. Thanks to the dynamic and speculative context offered by Apollo, non-linear behaviors are either modeled using linear regression hyperplanes forming tubes, or using ranges of reached values. Our approach enables the application of polyhedral transformations to non-linear codes thanks to an hybrid centralized-decentralized speculation verification system
|
7 |
An Adaptive Recompilation Framework For Rotor And Architectural Support For Online Program InstrumentationVaswani, Kapil 08 1900 (has links)
Microsoft Research / Although runtime systems and the dynamic compilation model have revolutionized the process of application development and deployment, the associated performance overheads continue to be a cause for concern and much research. In the first part of this thesis, we describe the design and implementation of an adaptive recompilation framework for Rotor, a shared source implementation of the Common Language Infrastructure (CLI) that can increase program performance through intelligent recompilation decisions and optimizations based on the program's past behavior. Our extensions to Rotor include a low overhead runtime-stack based sampling profiler that identifies program hotspots. A recompilation controller oversees the recompilation process and generates recompilation requests. At the first-level of a multi-level optimizing compiler, code in the intermediate language is converted to an internal intermediate representation and optimized using a set of simple transformations. The compiler uses a fast yet effective linear scan algorithm for register allocation. Hot methods can be instrumented in order to collect basic-block, edge and call-graph profile information.
Profile-guided optimizations driven by online profile information are used to further optimize
heavily executed methods at the second level of recompilation. An evaluation of the framework using a set of test programs shows that performance can improve by a maximum of 42.3% and by 9% on average. Our results also show that the overheads of collecting accurate profile information through instrumentation to an extent outweigh the benefits of profile-guided optimizations in our implementation, suggesting the need for implementing techniques that can reduce such overheads. A flexible and extensible framework design implies that additional profiling and optimization techniques can be easily incorporated to further improve performance.
As previously stated, fine-grained and accurate profile information must be available at low cost for advanced profile-guided optimizations to be effective in online environments. In this second part of this thesis, we propose a generic framework that makes it possible for instrumentation based profilers to collect profile data efficiently, a task that has traditionally been associated with high overheads. The essence of the scheme is to make the underlying hardware aware of instrumentation using a special set of profile instructions and tuned microarchitecture. This not only allows the hardware to provide the runtime with mechanisms to control the profiling activity, but also makes it possible for the hardware itself to optimize the process of profiling in a manner transparent to the runtime.
We propose selective instruction dispatch as one possible controlling mechanism that can be used by the runtime to manage the execution of profile instructions and keep profiling overheads under check. We propose profile flag prediction, a hardware optimization that complements the selective dispatch mechanism by not fetching profile instructions when the runtime has turned profiling off. The framework is light-weight and flexible. It eliminates the need for expensive book-keeping, recompilation or code duplication. Our simulations with benchmarks from the SPEC CPU2000 suite show that overheads for call-graph and basic block profiling can be reduced by 72.7% and 52.4% respectively with a negligible loss in accuracy.
|
Page generated in 0.1185 seconds