291 |
Runtime specialization for heterogeneous CPU-GPU platformsFarooqui, Naila 27 May 2016 (has links)
Heterogeneous parallel architectures like those comprised of CPUs and GPUs are a tantalizing compute fabric for performance-hungry developers. While these platforms enable order-of-magnitude performance increases for many data-parallel application domains, there remain several open challenges: (i) the distinct execution models inherent in the heterogeneous devices present on such platforms drives the need to dynamically match workload characteristics to the underlying resources, (ii) the complex architecture and programming models of such systems require substantial application knowledge and effort-intensive program tuning to achieve high performance, and (iii) as such platforms become prevalent, there is a need to extend their utility from running known regular data-parallel applications to the broader set of input-dependent, irregular applications common in enterprise settings. The key contribution of our research is to enable runtime specialization on such hybrid CPU-GPU platforms by matching application characteristics to the underlying heterogeneous resources for both regular and irregular workloads. Our approach enables profile-driven resource management and optimizations for such platforms, providing high application performance and system throughput. Towards this end, this research: (a) enables dynamic instrumentation for GPU-based parallel architectures, specifically targeting the complex Single-Instruction Multiple-Data (SIMD) execution model, to gain real-time introspection into application behavior; (b) leverages such dynamic performance data to support novel online resource management methods that improve application performance and system throughput, particularly for irregular, input-dependent applications; (c) automates some of the programmer effort required to exercise specialized architectural features of such platforms via instrumentation-driven dynamic code optimizations; and (d) proposes a specialized, affinity-aware work-stealing scheduling runtime for integrated CPU-GPU processors that efficiently distributes work across all CPU and GPU cores for improved load balance, taking into account both application characteristics and architectural differences of the underlying devices.
|
292 |
On the fly type specialization without type analysisChevalier-Boisvert, Maxime 12 1900 (has links)
Les langages de programmation typés dynamiquement tels que JavaScript et Python repoussent la vérification de typage jusqu’au moment de l’exécution. Afin d’optimiser la performance de ces langages, les implémentations de machines virtuelles pour langages dynamiques doivent tenter d’éliminer les tests de typage dynamiques redondants. Cela se fait habituellement en utilisant une analyse d’inférence de types. Cependant, les analyses de ce genre sont souvent coûteuses et impliquent des compromis entre le temps de compilation et la précision des résultats obtenus. Ceci a conduit à la conception d’architectures de VM de plus en plus complexes.
Nous proposons le versionnement paresseux de blocs de base, une technique de compilation à la volée simple qui élimine efficacement les tests de typage dynamiques redondants sur les chemins d’exécution critiques. Cette nouvelle approche génère paresseusement des versions spécialisées des blocs de base tout en propageant de l’information de typage contextualisée. Notre technique ne nécessite pas l’utilisation d’analyses de programme coûteuses, n’est pas contrainte par les limitations de précision des analyses d’inférence de types traditionnelles et évite la complexité des techniques d’optimisation spéculatives.
Trois extensions sont apportées au versionnement de blocs de base afin de lui donner des capacités d’optimisation interprocédurale. Une première extension lui donne la possibilité de joindre des informations de typage aux propriétés des objets et aux variables globales. Puis, la spécialisation de points d’entrée lui permet de passer de l’information de typage des fonctions appellantes aux fonctions appellées. Finalement, la spécialisation des continuations d’appels permet de transmettre le type des valeurs de retour des fonctions appellées aux appellants sans coût dynamique. Nous démontrons empiriquement que ces extensions permettent au versionnement de blocs de base d’éliminer plus de tests de typage dynamiques que toute analyse d’inférence de typage statique. / Dynamically typed programming languages such as JavaScript and Python defer type checking to run time. In order to maximize performance, dynamic language virtual
machine implementations must attempt to eliminate redundant dynamic type checks. This is typically done using type inference analysis. However, type inference analyses
are often costly and involve tradeoffs between compilation time and resulting precision. This has lead to the creation of increasingly complex multi-tiered VM architectures.
We introduce lazy basic block versioning, a simple just-in-time compilation technique which effectively removes redundant type checks from critical code paths. This
novel approach lazily generates type-specialized versions of basic blocks on the fly while propagating context-dependent type information. This does not require the use of costly
program analyses, is not restricted by the precision limitations of traditional type analyses and avoids the implementation complexity of speculative optimization techniques.
Three extensions are made to the basic block versioning technique in order to give it interprocedural optimization capabilities. Typed object shapes give it the ability to
attach type information to object properties and global variables. Entry point specialization allows it to pass type information from callers to callees, and call continuation
specialization makes it possible to pass return value type information back to callers without dynamic overhead. We empirically demonstrate that these extensions enable
basic block versioning to exceed the capabilities of static whole-program type analyses.
|
293 |
Compilation efficace d'applications de traitement d'images pour processeurs manycore / Efficient Compilation of Image Processing Applications for Manycore ProcessorsGuillou, Pierre 30 November 2016 (has links)
Nous assistons à une explosion du nombre d’appareils mobiles équipés de capteurs optiques : smartphones, tablettes, drones... préfigurent un Internet des objets imminent. De nouvelles applications de traitement d’images (filtres, compression, réalité augmentée) exploitent ces capteurs mais doivent répondre à des contraintes fortes de vitesse et d’efficacité énergétique. Les architectures modernes — processeurs manycore, GPUs,... — offrent un potentiel de performance, avec cependant une hausse sensible de la complexité de programmation.L’ambition de cette thèse est de vérifier l’adéquation entre le domaine du traitement d’images et ces architectures modernes : concilier programmabilité, portabilité et performance reste encore aujourd’hui un défi. Le domaine du traitement d’images présente un fort parallélisme intrinsèque, qui peut potentiellement être exploité par les différents niveaux de parallélisme offerts par les architectures actuelles. Nous nous focalisons ici sur le domaine du traitement d’images par morphologie mathématique, et validons notre approche avec l’architecture manycore du processeur MPPA de la société Kalray.Nous prouvons d’abord la faisabilité de chaînes de compilation intégrées, composées de compilateurs, bibliothèques et d’environnements d’exécution, qui à partir de langages de haut niveau tirent parti de différents accélérateurs matériels. Nous nous concentrons plus particulièrement sur les processeurs manycore, suivant les différents modèles de programmation : OpenMP ; langage flot de données ; OpenCL ; passage de messages. Trois chaînes de compilation sur quatre ont été réalisées, et sont accessibles à des applications écrites dans des langages spécifiques au domaine du traitement d’images intégrés à Python ou C. Elles améliorent grandement la portabilité de ces applications, désormais exécutables sur un plus large panel d’architectures cibles.Ces chaînes de compilation nous ont ensuite permis de réaliser des expériences comparatives sur un jeu de sept applications de traitement d’images. Nous montrons que le processeur MPPA est en moyenne plus efficace énergétiquement qu’un ensemble d’accélérateurs matériels concurrents, et ceci particulièrement avec le modèle de programmation flot de données. Nous montrons que la compilation d’un langage spécifique intégré à Python vers un langage spécifique intégré à C permet d’augmenter la portabilité et d’améliorer les performances des applications écrites en Python.Nos chaînes de compilation forment enfin un environnement logiciel complet dédié au développement d’applications de traitement d’images par morphologie mathématique, capable de cibler efficacement différentes architectures matérielles, dont le processeur MPPA, et proposant des interfaces dans des langages de haut niveau. / Many mobile devices now integrate optic sensors; smartphones, tablets, drones... are foreshadowing an impending Internet of Things (IoT). New image processing applications (filters, compression, augmented reality) are taking advantage of these sensors under strong constraints of speed and energy efficiency. Modern architectures, such as manycore processors or GPUs, offer good performance, but are hard to program.This thesis aims at checking the adequacy between the image processing domain and these modern architectures: conciliating programmability, portability and performance is still a challenge today. Typical image processing applications feature strong, inherent parallelism, which can potentially be exploited by the various levels of hardware parallelism inside current architectures. We focus here on image processing based on mathematical morphology, and validate our approach using the manycore architecture of the Kalray MPPA processor.We first prove that integrated compilation chains, composed of compilers, libraries and run-time systems, allow to take advantage of various hardware accelerators from high-level languages. We especially focus on manycore processors, through various programming models: OpenMP, data-flow language, OpenCL, and message passing. Three out of four compilation chains have been developed, and are available to applications written in domain-specific languages (DSL) embedded in C or Python. They greatly improve the portability of applications, which can now be executed on a large panel of target architectures.Then, these compilation chains have allowed us to perform comparative experiments on a set of seven image processing applications. We show that the MPPA processor is on average more energy-efficient than competing hardware accelerators, especially with the data-flow programming model. We show that compiling a DSL embedded in Python to a DSL embedded in C increases both the portability and the performance of Python-written applications.Thus, our compilation chains form a complete software environment dedicated to image processing application development. This environment is able to efficiently target several hardware architectures, among them the MPPA processor, and offers interfaces in high-level languages.
|
294 |
Automatização do processo de seleção de transformações para otimização do tempo de execução por meio de aprendizado de máquina no arcabouço da LLVM. / Transformation selection process automation for execution time optimization through machine learning on LLVM framework.Sabaliauskas, Jorge Augusto 28 April 2015 (has links)
A rápida evolução do hardware demanda uma evolução contínua dos compiladores. Um processo de ajuste deve ser realizado pelos projetistas de compiladores para garantir que o código gerado pelo compilador mantenha uma determinada qualidade, seja em termos de tempo de processamento ou outra característica pré-definida. Este trabalho visou automatizar o processo de ajuste de compiladores por meio de técnicas de aprendizado de máquina. Como resultado os planos de compilação obtidos usando aprendizado de máquina com as características propostas produziram código para programas cujos valores para os tempos de execução se aproximaram daqueles seguindo o plano padrão utilizado pela LLVM. / The fast evolution of hardware demands a continue evolution of the compilers. Compiler designers must perform a tuning process to ensure that the code generated by the compiler maintain a certain quality, both in terms of processing time or another preset feature. This work aims to automate compiler adjustment process through machine learning techniques. As a result the compiler plans obtained using machine learning with the proposed features had produced code for programs whose values for the execution times approached those following the standard plan used by LLVM.
|
295 |
Pattern matching in compilers / Pattern matching in compilersBílka, Ondřej January 2012 (has links)
Title: Pattern matching in compilers Author: Ondřej Bílka Department: Department of Applied Mathematics Supervisor: Jan Hubička, Department of Applied Mathematics Abstract: In this thesis we develop tools for effective and flexible pattern matching. We introduce a new pattern matching system called amethyst. Amethyst is not only a generator of parsers of programming languages, but can also serve as an alternative to tools for matching regular expressions. Our framework also produces dynamic parsers. Its intended use is in the context of IDE (accurate syntax highlighting and error detection on the fly). Amethyst offers pattern matching of general data structures. This makes it a useful tool for implement- ing compiler optimizations such as constant folding, instruction scheduling, and dataflow analysis in general. The parsers produced are essentially top-down parsers. Linear time complexity is obtained by introducing the novel notion of structured grammars and reg- ularized regular expressions. Amethyst uses techniques known from compiler optimizations to produce effective parsers. Keywords: Packrat parsing, dynamic parsing, structured grammars, functional programming 1
|
296 |
Automatização do processo de seleção de transformações para otimização do tempo de execução por meio de aprendizado de máquina no arcabouço da LLVM. / Transformation selection process automation for execution time optimization through machine learning on LLVM framework.Jorge Augusto Sabaliauskas 28 April 2015 (has links)
A rápida evolução do hardware demanda uma evolução contínua dos compiladores. Um processo de ajuste deve ser realizado pelos projetistas de compiladores para garantir que o código gerado pelo compilador mantenha uma determinada qualidade, seja em termos de tempo de processamento ou outra característica pré-definida. Este trabalho visou automatizar o processo de ajuste de compiladores por meio de técnicas de aprendizado de máquina. Como resultado os planos de compilação obtidos usando aprendizado de máquina com as características propostas produziram código para programas cujos valores para os tempos de execução se aproximaram daqueles seguindo o plano padrão utilizado pela LLVM. / The fast evolution of hardware demands a continue evolution of the compilers. Compiler designers must perform a tuning process to ensure that the code generated by the compiler maintain a certain quality, both in terms of processing time or another preset feature. This work aims to automate compiler adjustment process through machine learning techniques. As a result the compiler plans obtained using machine learning with the proposed features had produced code for programs whose values for the execution times approached those following the standard plan used by LLVM.
|
297 |
Exekveringsmiljö för Plex-C på JVM / Run-time environment for Plex-C on JVMMöller, Johan January 2002 (has links)
<p>The Ericsson AXE-based systems are programmed using an internally developed language called Plex-C. Plex-C is normally compiled to execute on an Ericsson internal processor architecture. A transition to standard processors is currently in progress. This makes it interesting to examine if Plex-C can be compiled to execute on the JVM, which would make it processor independent. </p><p>The purpose of the thesis is to examine if parts of the run-time environment of Plex-C can be translated to Java and if this can be done so that sufficient performance is obtained. It includes how language constructions in Plex-C can be translated to Java. </p><p>The thesis describes how a limited part of the Plex-C run-time environment is implemented in Java. Optimizations are an important part of the implementation. </p><p>It is also described how the JVM system was tested with a benchmark test. </p><p>The test results indicate that the implemented system is a few times faster than the Ericsson internal processor architecture. But this performance is still not sufficient for the JVM system to be an interesting replacement for the currently used processor architecture. It might still be useful as a processor independent test platform.</p>
|
298 |
Heuristisk profilbaserad optimering av instruktionscache i en online Just-In-Time kompilator / Heuristic Online Profile Based Instruction Cache Optimisation in a Just-In-Time CompilerEng, Stefan January 2004 (has links)
<p>This master’s thesis examines the possibility to heuristically optimise instruction cache performance in a Just-In-Time (JIT) compiler. </p><p>Programs that do not fit inside the cache all at once may suffer from cache misses as a result of frequently executed code segments competing for the same cache lines. A new heuristic algorithm LHCPA was created to place frequently executed code segments to avoid cache conflicts between them, reducing the overall cache misses and reducing the performance bottlenecks. Set-associative caches are taken into consideration and not only direct mapped caches. </p><p>In Ahead-Of-Time compilers (AOT), the problem with frequent cache misses is often avoided by using call graphs derived from profiling and more or less complex algorithms to estimate the performance for different placements approaches. This often results in heavy computation during compilation which is not accepted in a JIT compiler. </p><p>A case study is presented on an Alpha processor and an at Ericsson developed JIT Compiler. The results of the case study shows that cache performance can be improved using this technique but also that a lot of other factors influence the result of the cache performance. Such examples are whether the cache is set-associative or not; and especially the size of the cache highly influence the cache performance.</p>
|
299 |
Compiler Optimizations for Multithreaded Multicore Network ProcessorsZhuang, Xiaotong 07 July 2006 (has links)
Network processors are new types of multithreaded multicore processors geared towards achieving both fast processing speed and flexibility of programming. The architecture of network processors considers many special properties for packet processing, including multiple threads, multiple processor cores on the same chip, special functional units, simplified ISA and simplified pipeline, etc. The architectural peculiarities of network processors raise new challenges for compiler design and optimization.
Due to very high clocking speeds, the CPU memory gap on such processors is huge, making registers extremely precious. Moreover, the register file is split into two banks, and for any ALU instruction, the two source operands must come from different banks. We present and compare three different approaches to do register allocation and bank assignment. We also address the problem of sharing registers across threads in order to maximize the utilization of hardware resources. The context switches on the IXP network processor only happen when long latency operations are encountered. As a result, context switches are highly frequent. Therefore, the designer of the IXP network processor decided to make context switches extremely lightweight, i.e. only the program counter(PC) is stored together with the context. Since registers are not saved and restored during context switches, it becomes difficult to share registers across threads. For a conventional processor, each thread can assume that it can use the entire register file, because registers are always part of the context. However, with lightweight context switch, each thread must take a separate piece of the register file, making register usage inefficient.
Programs executing on network processors typically have runtime constraints. Scheduling of multiple programs sharing a CPU must be orchestrated by the OS and the hardware using certain sharing policies. Real time applications demand a real time aware OS kernel to meet their specified deadlines. However, due to stringent performance requirements on network processors, neither OS nor hardware mechanisms is typically feasible. In this work, we demonstrate that a compiler approach could achieve some of the OS scheduling and real time scheduling functionalities without introducing a hefty overhead.
|
300 |
RADAR: compiler and architecture supported intrusion prevention, detection, analysis and recoveryZhang, Tao 25 August 2006 (has links)
In this dissertation, we propose RADAR - compileR and micro-Architecture supported intrusion prevention, Detection, Analysis and Recovery. RADAR is an infrastructure to help prevent, detect and even recover from attacks to critical software. Our approach emphasizes collaborations between compiler and micro-architecture to avoid the problems of purely software or hardware based approaches.
With hardware support for cryptographic operations, our infrastructure can achieve strong process isolation to prevent attacks from other processes and to prevent certain types of hardware attacks. Moreover, we show that an unprotected system address bus leaks critical control flow information of the protected software but has never been carefully addressed previously. To enhance intrusion prevention capability of our infrastructure further, we present a scheme with both innovative hardware modification and extensive compiler support to eliminate most of the information leakage on system address bus.
However, no security system is able to prevent all attacks. In general, we have to assume that certain attacks will get through our intrusion prevention mechanisms. To protect software from those attacks, we build a second line of defense consisted of intrusion detection and intrusion recovery mechanisms. Our intrusion detection mechanisms are based on anomaly detection. In this dissertation, we propose three anomaly detection schemes. We demonstrate the effectiveness of our anomaly detection schemes thus the great potential of what compiler and micro-architecture can do for software security.
The ability to recover from an attack is very important for systems providing critical services. Thus, intrusion recoverability is an important goal of our infrastructure. We focus on recovery of memory state in this dissertation, since most attacks break into a system by memory tampering. We propose two schemes for intrusion analysis. The execution logging based scheme incurs little performance overhead but has higher demand for storage and memory bandwidth. The external input points tagging based scheme is much more space and memory bandwidth efficient, but leads to significant performance degradation. After intrusion analysis is done and tampered memory state is identified, tampered memory state can be easily recovered through memory updates logging or memory state checkpointing.
|
Page generated in 0.0497 seconds