Global ETD Search

11	Integrated Software Pipelining Eriksson, Mattias January 2009 (has links) <p>In this thesis we address the problem of integrated software pipelining for clustered VLIW architectures. The phases that are integrated and solved as one combined problem are: cluster assignment, instruction selection, scheduling, register allocation and spilling.</p><p>As a first step we describe two methods for integrated code generation of basic blocks. The first method is optimal and based on integer linear programming. The second method is a heuristic based on genetic algorithms.</p><p>We then extend the integer linear programming model to modulo scheduling. To the best of our knowledge this is the first time anybody has optimally solved the modulo scheduling problem for clustered architectures with instruction selection and cluster assignment integrated.</p><p>We also show that optimal spilling is closely related to optimal register allocation when the register files are clustered. In fact, optimal spilling is as simple as adding an additional virtual register file representing the memory and have transfer instructions to and from this register file corresponding to stores and loads.</p><p>Our algorithm for modulo scheduling iteratively considers schedules with increasing number of schedule slots. A problem with such an iterative method is that if the initiation interval is not equal to the lower bound there is no way to determine whether the found solution is optimal or not. We have proven that for a class of architectures that we call transfer free, we can set an upper bound on the schedule length. I.e., we can prove when a found modulo schedule with initiation interval larger than the lower bound is optimal.</p><p>Experiments have been conducted to show the usefulness and limitations of our optimal methods. For the basic block case we compare the optimal method to the heuristic based on genetic algorithms.<em></em></p><p><em>This work has been supported by The Swedish national graduate school in computer science (CUGS) and Vetenskapsrådet (VR).</em></p> Code generation compilers instruction scheduling register allocation spill code generation modulo scheduling integer linear programming genetic programming. Computer science Datavetenskap
12	Decoupled approaches to register and software controlled memory allocations Diouf, Boubacar 15 December 2011 (has links) (PDF) Despite the benefit of the memory hierarchy, it is still essential, in order to reduce accesses to higher levels of memory, to have an efficient usage of registers and local memories (also called scratchpad memories) present in most embedded processors, graphical processors (GPUs) and network processors. During the compilation, from a source language to an executable code, there are two optimizations that are of utmost importance: the register allocation and the local memory allocation. In this thesis's report we are interested in decoupled approaches, solving separately the allocation and assignment problems, that helps to improve the quality of the register and local memory allocations. In the first part of this thesis we are interested in two aspects of the register allocation problem: the improvements of the just-in-time (JIT) register allocation and the spill minimization problem. We introduce the split register allocation which leverages the decoupled approach to improve register allocation in the context of JIT compilation. We experimentally validate the effectiveness of split register allocation and its portability with respect to register count variations, relying on annotations whose impact on the bytecode size is negligible. We introduce a new decoupled approach, called iterated-optimal allocation, which focus on the spill minimization problem. The iterated-optimal allocation algorithm achieves results close to optimal while offering pseudo-polynomial guarantees for SSA programs and fast allocations on general programs. In the second part of this thesis, we study how a decoupled local memory allocation can be proposed in light of recent progresses in register allocation. We first validate our intuition for decoupled approach to local memory allocation. Then, we study the local memory allocation in a more theoretical way setting the junction between local memory allocation for linearized programs and weighted interval graph coloring. We design and analyze a new variant of the ship-building problem called the submarine-building problem. We show that this problem is NP-complete on interval graphs, while it is solvable in linear time for proper interval graphs, equivalent to unit interval graphs. The submarine-building problem is the first problem that is known to be NP-complete on interval graphs, while it is solvable in linear time for unit interval graphs. In the third part of this thesis, we propose a heuristic-based solution, the clustering allocator, which decouples the local memory allocation problem and aims to minimize the allocation cost. The clustering allocator while devised for local memory allocation, it appears to be a very good solution to the register allocation problem. After many years of separation, this new algorithm seems to be a bridge to reconcile the local memory allocation and the register allocation problems. [INFO:INFO_OH] Computer Science/Other [INFO:INFO_OH] Informatique/Autre Compilation Register allocation Memory allocations Scratchpad Weighted graphs coloring Submarine-building problem
13	Compiler Optimizations for Multithreaded Multicore Network Processors Zhuang, Xiaotong 07 July 2006 (has links) Network processors are new types of multithreaded multicore processors geared towards achieving both fast processing speed and flexibility of programming. The architecture of network processors considers many special properties for packet processing, including multiple threads, multiple processor cores on the same chip, special functional units, simplified ISA and simplified pipeline, etc. The architectural peculiarities of network processors raise new challenges for compiler design and optimization. Due to very high clocking speeds, the CPU memory gap on such processors is huge, making registers extremely precious. Moreover, the register file is split into two banks, and for any ALU instruction, the two source operands must come from different banks. We present and compare three different approaches to do register allocation and bank assignment. We also address the problem of sharing registers across threads in order to maximize the utilization of hardware resources. The context switches on the IXP network processor only happen when long latency operations are encountered. As a result, context switches are highly frequent. Therefore, the designer of the IXP network processor decided to make context switches extremely lightweight, i.e. only the program counter(PC) is stored together with the context. Since registers are not saved and restored during context switches, it becomes difficult to share registers across threads. For a conventional processor, each thread can assume that it can use the entire register file, because registers are always part of the context. However, with lightweight context switch, each thread must take a separate piece of the register file, making register usage inefficient. Programs executing on network processors typically have runtime constraints. Scheduling of multiple programs sharing a CPU must be orchestrated by the OS and the hardware using certain sharing policies. Real time applications demand a real time aware OS kernel to meet their specified deadlines. However, due to stringent performance requirements on network processors, neither OS nor hardware mechanisms is typically feasible. In this work, we demonstrate that a compiler approach could achieve some of the OS scheduling and real time scheduling functionalities without introducing a hefty overhead. Runtime constraints Register allocation Compiler optimization Multitheading Multicore Network processors Registers (Computers) Compilers (Computer programs) High performance processors Packet switching (Data transmission)
14	Integrating A New Cluster Assignment And Scheduling Algorithm Into An Experimental Retargetable Code Generation Framework Vasanta Lakshmi, Kommineni 05 1900 (has links) This thesis presents a new uniﬁed algorithm for cluster assignment and acyclic region scheduling in a partitioned architecture, and preliminary results on its integration into an experimental retargetable code generation framework. The object of this work is twofold. Firstly, to validate for the ﬁrst time, and evaluate the framework which is almost automatic, so as to gain insights into possibilities for improvement. This was done by using as a baseline for comparison, highly optimized code generated by the handcrafted compiler of Texas Instruments, the TI Code Composer Studio V2. The second objective is to compare the integrated scheduling algorithm with another well known algorithm which performs scheduling and cluster allocation in the same phase, the Uniﬁed Assign and Schedule (UAS) algorithm. The computational complexity of the two algorithms is comparable. The components of the framework experimented with here are (a) a tree transformer generator, which takes as input, a description of the instruction set of the target architecture in the form of a regular tree grammar augmented with actions and attributes, and outputs a data dependency directed acyclic graph, (b) the well known public domain IMPACT front end for C, (c)a microarchitecture description module which uses a modiﬁcation of the HMDES architecture description language of the TRIMARAN project, to include cluster information, and (d) a combined cluster allocator and acyclic region scheduler and a register allocator designed and implemented by us. Experiments have been carried out on creating the proper interfaces for all the modules to work together, and the targeting of the tool to the Texas Instruments TMS320c62x architecture to establish the feasibility of this approach. We present the results of our implementation on a set of benchmarks and some sorting programs and compare them with those obtained from the state-of-the-art TI compiler. The performance without software pipelining shows that our executables take on the average 1.4 times the execution time as that of those generated by the TI compiler. The integrated scheduling algorithm proposed in this thesis performs at least as well as the UAS algorithm and sometimes better by as much as 9 % in terms of the parallelism obtained. Computer Architecture HMDES Architecture Tree Transformer Generator Register Allocation Algorithm Integrated Scheduling Algorithm TI Code Retargetable Code Generation Framework Target Processor Computer Science
15	Spill Code Minimization And Buffer And Code Size Aware Instruction Scheduling Techniques Nagarakatte, Santosh G 08 1900 (has links) Instruction scheduling and Software pipelining are important compilation techniques which reorder instructions in a program to exploit instruction level parallelism. They are essential for enhancing instruction level parallelism in architectures such as very Long Instruction Word and tiled processors. This thesis addresses two important problems in the context of these instruction reordering techniques. The first problem is for general purpose applications and architectures, while the second is for media and graphics applications for tiled and multi-core architectures. The first problem deals with software pipelining which is an instruction scheduling technique that overlaps instructions from multiple iterations. Software pipelining increases the register pressure and hence it may be required to introduce spill instructions. In this thesis, we model the problem of register allocation with optimal spill code generation and scheduling in software pipelined loops as a 0-1 integer linear program. By minimizing the amount of spill code produced, the formulation ensures that the initiation interval (II) between successive iterations of the loop is not increased unnecessarily. Experimental results show that our formulation performs better than the existing heuristics by preventing an increase in the II and also generating less spill code on average among loops extracted from Perfect Club and SPEC benchmarks. The second major contribution of the thesis deals with the code size aware scheduling of stream programs. Large scale synchronous dataflow graphs (SDF’s) and StreamIt have emerged as powerful programming models for high performance streaming applications. In these models, a program is represented as a dataflow graph where each node represents an autonomous filter and the edges represent the channels through which the nodes communicate. In constructing static schedules for programs in these models, it is important to optimize the execution time buffer requirements of the data channel and the space required to store the encoded schedule. Earlier approaches have either given priority to one of the requirements or proposed ad-hoc methods for generating schedules with good trade-offs. In this thesis, we propose a genetic algorithm framework based on non-dominated sorting for generating serial schedules which have good trade-off between code size and buffer requirement. We extend the framework to generate software pipelined schedules for tiled architectures. From our experiments, we observe that the genetic algorithm framework generates schedules with good trade-off and performs better than the earlier approaches. Compilers Parallel Processing Instruction Scheduling Software Pipelining Spill Code Scheduling Stream Programs - Scheduling Genetic Algorithms Integer Linear Programming Software Pipelined Loops Register Allocation Computer Science
16	Decoupled (SSA-based) register allocators : from theory to practice, coping with just-in-time compilation and embedded processors constraints Colombet, Quentin 07 December 2012 (has links) (PDF) My thesis deals with register allocation. During this phase, the compiler has to assign variables of the source program, in an arbitrary big number, to actual registers of the processor, in a limited number k. Recent works, for instance the thesis of F. Bouchez and S. Hack, have shown that it is possible to split in two different decoupled step this phase: the spill - store the variables into memory to release registers - followed by the registers assignment. These works demonstrate the feasibility of this decoupling relying on a theoretic framework and some assumptions. In particular, it is sufficient to ensure after the spill step that the number of variables simultaneously live is below k.My thesis follows these works by showing how to apply this kind of approach when real-world constraints come in play: instructions encoding, ABI (application binary interface), register aliasing. Different approaches are proposed. They allow either to ignore these problems or to directly tackle them into the theoretic framework. The hypothesis of the models and the proposed solutions are evaluated and validated using a thorough experimental study with the compiler of STMicroelectronics. Finally, all these works have been done with the constraints of modern compilers in mind, the JIT (just-in-time) compilation, where the compilation time et the memory footprint of the compiler are key factors. We strive to offer solutions that cope with these criteria or improve the result until a given budget is reached. We, in particular, used the SSA (static single assignment) form to define algorithm like tree scan that generalizes linear scan based approaches proposed for JIT compilation. Decoupled register allocation JIT SSA Register aliasing Precoloring Spilling
17	High Performance by Exploiting Information Locality through Reverse Computing Bahi, Mouad 21 December 2011 (has links) (PDF) The main resources for computation are time, space and energy. Reducing them is the main challenge in the field of processor performance.In this thesis, we are interested in a fourth factor which is information. Information has an important and direct impact on these three resources. We show how it contributes to performance optimization. Landauer has suggested that independently on the hardware where computation is run information erasure generates dissipated energy. This is a fundamental result of thermodynamics in physics. Therefore, under this hypothesis, only reversible computations where no information is ever lost, are likely to be thermodynamically adiabatic and do not dissipate power. Reversibility means that data can always be retrieved from any point of the program. Information may be carried not only by the data but also by the process and input data that generate it. When a computation is reversible, information can also be retrieved from other already computed data and reverse computation. Hence reversible computing improves information locality.This thesis develops these ideas in two directions. In the first part, we address the issue of making a computation DAG (directed acyclic graph) reversible in terms of spatial complexity. We define energetic garbage as the additional number of registers needed for the reversible computation with respect to the original computation. We propose a reversible register allocator and we show empirically that the garbage size is never more than 50% of the DAG size. In the second part, we apply this approach to the trade-off between recomputing (direct or reverse) and storage in the context of supercomputers such as the recent vector and parallel coprocessors, graphical processing units (GPUs), IBM Cell processor, etc., where the gap between processor cycle time and memory access time is increasing. We show that recomputing in general and reverse computing in particular helps reduce register requirements and memory pressure. This approach of reverse rematerialization also contributes to the increase of instruction-level parallelism (Cell) and thread-level parallelism in multicore processors with shared register/memory file (GPU). On the latter architecture, the number of registers required by the kernel limits the number of running threads and affects performance. Reverse rematerialization generates additional instructions but their cost can be hidden by the parallelism gain. Experiments on the highly memory demanding Lattice QCD simulation code on Nvidia GPU show a performance gain up to 11%. [INFO:INFO_OH] Computer Science/Other [INFO:INFO_OH] Informatique/Autre Reversible computing Performance optimization Information locality Rematerialization Register allocation Spill code Instruction-level parallelism Thread-l
18	Decoupled (SSA-based) register allocators : from theory to practice, coping with just-in-time compilation and embedded processors constraints / Allocation de registres découplée (basée sur la formulation SSA) : De la théorie à la pratique, faire face aux contraintes liées à la compilation juste à temps et aux processeurs embarqués Colombet, Quentin 07 December 2012 (has links) Ma thèse porte sur l’allocation de registres. Durant cette étape, le compilateur doit assigner les variables du code source, en nombre arbitrairement grand, aux registres physiques du processeur, en nombre limité k. Des travaux récents, notamment ceux des thèses de F. Bouchez et S. Hack, ont montré qu’il était possible de séparer de manière complètement découplée cette étape en deux phases : le vidage (spill) – stockage de variables en mémoire pour libérer des registres – suivi de l’assignation aux registres proprement dite. Ces travaux démontraient la faisabilité de ce découpage en s’appuyant sur un cadre théorique et certaines hypothèses simplificatrices. En particulier, il est suffisant de s’assurer qu’après le spill, le nombre de variables simultanément en vie est inférieur à k.Ma thèse fait suite à ces travaux en montrant comment appliquer ce type d’approche dans un cadre réaliste, en prenant en compte les contraintes liées à l’encodage des instructions, à l’ABI (application binary interface), aux bancs de registres avec aliasing. Différentes approches sont proposées qui permettent soit de s’affranchir des problèmes précités, soit de les prendre en compte directement dans le modèle théorique. Les hypothèses des modèles et les solutions proposées sont évaluées et validées par une étude expérimentale poussée dans le compilateur de STMicroelectronics. Enfin, l’ensemble de ces travaux a été réalisé avec, en ligne de mire, les contraintes de la compilation moderne, la compilation JIT (just-in-time), où rapidité et consommation mémoire du compilateur sont des facteurs déterminants. Nous nous sommes efforcés d’offrir des solutions satisfaisant ces critères ou améliorant les résultats attendus tant qu’un certain budget n’a pas été dépassé, exploitant en particulier la forme SSA (static single assignment) pour définir des algorithmes de type tree scan qui généralisent les approches de type linear scan, proposées pour le JIT. / My thesis deals with register allocation. During this phase, the compiler has to assign variables of the source program, in an arbitrary big number, to actual registers of the processor, in a limited number k. Recent works, for instance the thesis of F. Bouchez and S. Hack, have shown that it is possible to split in two different decoupled step this phase: the spill - store the variables into memory to release registers - followed by the registers assignment. These works demonstrate the feasibility of this decoupling relying on a theoretic framework and some assumptions. In particular, it is sufficient to ensure after the spill step that the number of variables simultaneously live is below k.My thesis follows these works by showing how to apply this kind of approach when real-world constraints come in play: instructions encoding, ABI (application binary interface), register aliasing. Different approaches are proposed. They allow either to ignore these problems or to directly tackle them into the theoretic framework. The hypothesis of the models and the proposed solutions are evaluated and validated using a thorough experimental study with the compiler of STMicroelectronics. Finally, all these works have been done with the constraints of modern compilers in mind, the JIT (just-in-time) compilation, where the compilation time et the memory footprint of the compiler are key factors. We strive to offer solutions that cope with these criteria or improve the result until a given budget is reached. We, in particular, used the SSA (static single assignment) form to define algorithm like tree scan that generalizes linear scan based approaches proposed for JIT compilation. Allocation de registres découplée JIT SSA Aliasing de registres Contraintes de registres (precoloring) Vidage en mémoire (spilling) Decoupled register allocation JIT SSA Register aliasing Precoloring Spilling
19	Decoupled approaches to register and software controlled memory allocations / Approches découplées aux problèmes d'allocations de registres et de mémoires locales Diouf, Boubacar 15 December 2011 (has links) Malgré la hiérarchie mémoire utilisée dans les ordinateurs modernes, il convient toujours d'optimiser l'utilisation des registres du processeur et des mémoires locales gérées de manières logicielles (mémoires locales) présentes dans beaucoup de systèmes embarqués, de processeurs graphiques (GPUs) et de multiprocesseurs. Lors de la compilation, d'un code source vers un langage machine, deux optimisations de la mémoire revêtent une importance capitale : l'allocation de registres et l'allocation de mémoires locales. Dans ce manuscrit de thèse nous nous intéressons à des approches découplées, qui traitent séparément les problèmes d'allocation et d'assignation, permettant d'améliorer les allocations de registres et de mémoires locales. Dans la première partie de la thèse, nous nous penchons sur le problème de l'allocation de registres. Tout d'abord, nous proposons dans le contexte des compilateurs-juste-à-temps, une allocation de registres fractionnées (split register allocation). Avec cette approche l'allocation de registres est effectuée en deux étapes: une faite durant la phase de compilation statique et l'autre pendant la phase de compilation dynamique. Ce qui permet de réduire le temps d'exécution des programmes avec un impact négligeable sur le temps de compilation. Ensuite Nous introduisons une allocation de registres incrémentale qui permet de résoudre d'une manière quasi-optimale le problème d'allocation. Cette méthode est pseudo-polynomiale alors que le problème d'allocation est NP-complet même à l'intérieur d'un « basic block ». Dans la deuxième partie de la thèse nous nous intéressons au problème de l'allocation de mémoires locales. Au vu des dernières avancées dans le domaine de l'allocation de registres, nous étudions dans quelle mesure le problème d'allocation pourrait être séparé de celui de l'assignation dans le contexte des mémoires locales. Dans un premier temps nous validons expérimentalement que les problèmes d'allocation et d'assignation peuvent être résolus séparément. Ensuite, nous procédons à une étude plus théorique d'une approche découplée de l'allocation de mémoires locales. Cela permet d'introduire de nouveaux résultats sur le « submarine-building problem », une variante du « ship-building problem », que nous avons défini. L'un de ces résultats met en évidence pour la première fois une différence de complexité (P vs. NP-complet) entre les graphes d'intervalles et les graphes d'intervalles unitaires. Dans la troisième partie de la thèse nous proposons une nouvelle heuristique, appelée « clustering allocator » fondée sur la construction de sous-graphes stables d'un graphe d'interférence, permettant de découpler aussi bien le problème d'allocation pour les registres que pour les mémoires locales. Cette nouvelle heuristique se veut le pont qui permettra de réconcilier les problèmes d'allocations de registres et de mémoires locales. / Despite the benefit of the memory hierarchy, it is still essential, in order to reduce accesses to higher levels of memory, to have an efficient usage of registers and local memories (also called scratchpad memories) present in most embedded processors, graphical processors (GPUs) and network processors. During the compilation, from a source language to an executable code, there are two optimizations that are of utmost importance: the register allocation and the local memory allocation. In this thesis's report we are interested in decoupled approaches, solving separately the allocation and assignment problems, that helps to improve the quality of the register and local memory allocations. In the first part of this thesis we are interested in two aspects of the register allocation problem: the improvements of the just-in-time (JIT) register allocation and the spill minimization problem. We introduce the split register allocation which leverages the decoupled approach to improve register allocation in the context of JIT compilation. We experimentally validate the effectiveness of split register allocation and its portability with respect to register count variations, relying on annotations whose impact on the bytecode size is negligible. We introduce a new decoupled approach, called iterated-optimal allocation, which focus on the spill minimization problem. The iterated-optimal allocation algorithm achieves results close to optimal while offering pseudo-polynomial guarantees for SSA programs and fast allocations on general programs. In the second part of this thesis, we study how a decoupled local memory allocation can be proposed in light of recent progresses in register allocation. We first validate our intuition for decoupled approach to local memory allocation. Then, we study the local memory allocation in a more theoretical way setting the junction between local memory allocation for linearized programs and weighted interval graph coloring. We design and analyze a new variant of the ship-building problem called the submarine-building problem. We show that this problem is NP-complete on interval graphs, while it is solvable in linear time for proper interval graphs, equivalent to unit interval graphs. The submarine-building problem is the first problem that is known to be NP-complete on interval graphs, while it is solvable in linear time for unit interval graphs. In the third part of this thesis, we propose a heuristic-based solution, the clustering allocator, which decouples the local memory allocation problem and aims to minimize the allocation cost. The clustering allocator while devised for local memory allocation, it appears to be a very good solution to the register allocation problem. After many years of separation, this new algorithm seems to be a bridge to reconcile the local memory allocation and the register allocation problems. Compilation Allocation de registres Allocation de mémoire Scratchpad Coloration de graphes pondérés Problème de submarine-building Compilation Register allocation Memory allocations Scratchpad Weighted graphs coloring Submarine-building problem
20	Integrated Software Pipelining Eriksson, Mattias January 2009 (has links) In this thesis we address the problem of integrated software pipelining for clustered VLIW architectures. The phases that are integrated and solved as one combined problem are: cluster assignment, instruction selection, scheduling, register allocation and spilling. As a first step we describe two methods for integrated code generation of basic blocks. The first method is optimal and based on integer linear programming. The second method is a heuristic based on genetic algorithms. We then extend the integer linear programming model to modulo scheduling. To the best of our knowledge this is the first time anybody has optimally solved the modulo scheduling problem for clustered architectures with instruction selection and cluster assignment integrated. We also show that optimal spilling is closely related to optimal register allocation when the register files are clustered. In fact, optimal spilling is as simple as adding an additional virtual register file representing the memory and have transfer instructions to and from this register file corresponding to stores and loads. Our algorithm for modulo scheduling iteratively considers schedules with increasing number of schedule slots. A problem with such an iterative method is that if the initiation interval is not equal to the lower bound there is no way to determine whether the found solution is optimal or not. We have proven that for a class of architectures that we call transfer free, we can set an upper bound on the schedule length. I.e., we can prove when a found modulo schedule with initiation interval larger than the lower bound is optimal. Experiments have been conducted to show the usefulness and limitations of our optimal methods. For the basic block case we compare the optimal method to the heuristic based on genetic algorithms. This work has been supported by The Swedish national graduate school in computer science (CUGS) and Vetenskapsrådet (VR). Code generation compilers instruction scheduling register allocation spill code generation modulo scheduling integer linear programming genetic programming. Computer Sciences Datavetenskap (datalogi)

Search results