• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 17
  • 3
  • 1
  • Tagged with
  • 22
  • 22
  • 18
  • 13
  • 11
  • 9
  • 8
  • 6
  • 6
  • 6
  • 5
  • 5
  • 5
  • 5
  • 5
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
11

Accélération matérielle pour la traduction dynamique de programmes binaires / Hardware acceleration of dynamic binary translation

Rokicki, Simon 17 December 2018 (has links)
Cette thèse porte sur l’utilisation de techniques d’accélération matérielle pour la conception de processeurs basés sur l’optimisation dynamique de binaires. Dans ce type de machine, les instructions du programme exécuté par le processeur sont traduites et optimisées à la volée par un outil de compilation dynamique intégré au processeur. Ce procédé permet de mieux exploiter les ressources du processeur cible, mais est délicate à exploiter car le temps de cette recompilation impacte de manière très significative l’effet global de ces optimisations. Dans cette thèse, nous montrons que l’utilisation d’accélérateurs matériels pour certaines étapes clés de cette compilation (construction de la représentation intermédiaire, ordonnancement des instructions), permet de ramener le temps de compilation à des valeurs très faible (en moyenne 6 cycles par instruction, contre plusieurs centaines dans le cas d’une mise en œuvre classique). Nous avons également montré comment ces techniques peuvent être exploitées pour offrir de meilleurs compromis performance/consommation sur certains types de noyaux de calculs. La thèse à également débouché sur la mise à disposition de la communauté de recherche du compilateur développé. / This thesis is focused on the hardware acceleration of processors based on Dynamic Binary Translation. Such architectures execute binaries by translating and optimizing each instruction at run-time, thanks to a DBT toolchain embedded in the system. This process leads to a better ressource utilization but also induces execution time overheads, which affect the overall performances. During this thesis, we've shown that the use of hardware components to accelerate critical parts of the DBT process (First translation, generation of an intermediate representation and instruction scheduling) drastically reduce the compilation time (around 6 cycles to schedule one instruction, against several hundreds for a fully-software DBT). We've also demonstrated that the proposed approach enables several continuous optimizations flow, which offers better energy/performance trade-offs. Finally, the DBT toolchain is open-source and available online.
12

Integrated Software Pipelining

Eriksson, Mattias January 2009 (has links)
<p>In this thesis we address the problem of integrated software pipelining for clustered VLIW architectures. The phases that are integrated and solved as one combined problem are: cluster assignment, instruction selection, scheduling, register allocation and spilling.</p><p>As a first step we describe two methods for integrated code generation of basic blocks. The first method is optimal and based on integer linear programming. The second method is a heuristic based on genetic algorithms.</p><p>We then extend the integer linear programming model to modulo scheduling. To the best of our knowledge this is the first time anybody has optimally solved the modulo scheduling problem for clustered architectures with instruction selection and cluster assignment integrated.</p><p>We also show that optimal spilling is closely related to optimal register allocation when the register files are clustered. In fact, optimal spilling is as simple as adding an additional virtual register file representing the memory and have transfer instructions to and from this register file corresponding to stores and loads.</p><p>Our algorithm for modulo scheduling iteratively considers schedules with increasing number of schedule slots. A problem with such an iterative method is that if the initiation interval is not equal to the lower bound there is no way to determine whether the found solution is optimal or not. We have proven that for a class of architectures that we call transfer free, we can set an upper bound on the schedule length. I.e., we can prove when a found modulo schedule with initiation interval larger than the lower bound is optimal.</p><p>Experiments have been conducted to show the usefulness and limitations of our optimal methods. For the basic block case we compare the optimal method to the heuristic based on genetic algorithms.<em></em></p><p><em>This work has been supported by The Swedish national graduate school in computer science (CUGS) and Vetenskapsrådet (VR).</em></p>
13

Integrated Optimal Code Generation for Digital Signal Processors

Bednarski, Andrzej January 2006 (has links)
<p>In this thesis we address the problem of optimal code generation for irregular architectures such as Digital Signal Processors (DSPs).</p><p>Code generation consists mainly of three interrelated optimization tasks: instruction selection (with resource allocation), instruction scheduling and register allocation. These tasks have been discovered to be NP-hard for most architectures and most situations. A common approach to code generation consists in solving each task separately, i.e. in a decoupled manner, which is easier from a software engineering point of view. Phase-decoupled compilers produce good code quality for regular architectures, but if applied to DSPs the resulting code is of significantly lower performance due to strong interdependences between the different tasks.</p><p>We developed a novel method for fully integrated code generation at the basic block level, based on dynamic programming. It handles the most important tasks of code generation in a single optimization step and produces an optimal code sequence. Our dynamic programming algorithm is applicable to small, yet not trivial problem instances with up to 50 instructions per basic block if data locality is not an issue, and up to 20 instructions if we take data locality with optimal scheduling of data transfers on irregular processor architectures into account. For larger problem instances we have developed heuristic relaxations.</p><p>In order to obtain a retargetable framework we developed a structured architecture specification language, xADML, which is based on XML. We implemented such a framework, called OPTIMIST that is parameterized by an xADML architecture specification.</p><p>The thesis further provides an Integer Linear Programming formulation of fully integrated optimal code generation for VLIW architectures with a homogeneous register file. Where it terminates successfully, the ILP-based optimizer mostly works faster than the dynamic programming approach; on the other hand, it fails for several larger examples where dynamic programming still provides a solution. Hence, the two approaches complement each other. In particular, we show how the dynamic programming approach can be used to precondition the ILP formulation.</p><p>As far as we know from the literature, this is for the first time that the main tasks of code generation are solved optimally in a single and fully integrated optimization step that additionally considers data placement in register sets and optimal scheduling of data transfers between different registers sets.</p>
14

Integrated Optimal Code Generation for Digital Signal Processors

Bednarski, Andrzej January 2006 (has links)
In this thesis we address the problem of optimal code generation for irregular architectures such as Digital Signal Processors (DSPs). Code generation consists mainly of three interrelated optimization tasks: instruction selection (with resource allocation), instruction scheduling and register allocation. These tasks have been discovered to be NP-hard for most architectures and most situations. A common approach to code generation consists in solving each task separately, i.e. in a decoupled manner, which is easier from a software engineering point of view. Phase-decoupled compilers produce good code quality for regular architectures, but if applied to DSPs the resulting code is of significantly lower performance due to strong interdependences between the different tasks. We developed a novel method for fully integrated code generation at the basic block level, based on dynamic programming. It handles the most important tasks of code generation in a single optimization step and produces an optimal code sequence. Our dynamic programming algorithm is applicable to small, yet not trivial problem instances with up to 50 instructions per basic block if data locality is not an issue, and up to 20 instructions if we take data locality with optimal scheduling of data transfers on irregular processor architectures into account. For larger problem instances we have developed heuristic relaxations. In order to obtain a retargetable framework we developed a structured architecture specification language, xADML, which is based on XML. We implemented such a framework, called OPTIMIST that is parameterized by an xADML architecture specification. The thesis further provides an Integer Linear Programming formulation of fully integrated optimal code generation for VLIW architectures with a homogeneous register file. Where it terminates successfully, the ILP-based optimizer mostly works faster than the dynamic programming approach; on the other hand, it fails for several larger examples where dynamic programming still provides a solution. Hence, the two approaches complement each other. In particular, we show how the dynamic programming approach can be used to precondition the ILP formulation. As far as we know from the literature, this is for the first time that the main tasks of code generation are solved optimally in a single and fully integrated optimization step that additionally considers data placement in register sets and optimal scheduling of data transfers between different registers sets.
15

Integrated Scheduling For Clustered VLIW Processors

Nagpal, Rahul 12 1900 (has links)
Clustered architecture processors are preferred for embedded systems because centralized register file architectures scale poorly in terms of clock rate, chip area, and power consumption. Scheduling for clustered architectures involves spatial concerns (where to schedule) as well as temporal concerns (when to schedule). Various clustered VLIW configurations, connectivity types, and inter-cluster communication models present different performance trade-offs to a scheduler. The scheduler is responsible for resolving the conflicting requirements of exploiting the parallelism offered by the hardware and limiting the communication among clusters to achieve better performance. Earlier proposals for cluster scheduling fall into two main categories, viz., phase-decoupled scheduling and phase-coupled scheduling and they focus on clustered architectures which provide inter-cluster communication by an explicit inter-cluster copy operation. However, modern commercial clustered architectures provide snooping capabilities (apart from the support for inter-cluster communication using an explicit MV operation) by allowing some of the functional units to read operands from the register file of some of the other clusters without any extra delay. The phase-decoupled approach of scheduling suffers from the well known phase-ordering problem which becomes severe for such a machine model (with snooping) because communication and resource constraints are tightly coupled and thus are exposed only during scheduling. Tight integration of communication and resource constraints further requires taking into account the resource and communication requirements of other instructions ready to be scheduled in the current cycle while binding an instruction, in order to carry out effective binding. However, earlier proposals on integrated scheduling consider instructions and clusters for binding using a fixed order and thus they show different widely varying performance characteristics in terms of execution time and code size. Other shortcomings of earlier integrated algorithms (that lead to suboptimal cluster scheduling decisions) are due to non-consideration of future communication (that may arise due to a binding) and functional unit binding. In this thesis, we propose a pragmatic scheme and also a generic graph matching based framework for cluster scheduling based on a generic and realistic clustered machine model. The proposed scheme effectively utilizes the exact knowledge of available communication slots, functional units, and load on different clusters as well as future resource and communication requirements known only at schedule time to attain significant performance improvement without code size penalty over earlier algorithms. The proposed graph matching based framework for cluster scheduling resolves the phase-ordering and fixed-ordering problem associated with scheduling on clustered VLIW architectures. The framework provides a mechanism to exploit the slack of instructions by dynamically varying the freedom available in scheduling an instruction and hence the cost of scheduling an instruction using different alternatives to reduce the inter-cluster communication. An experimental evaluation of the proposed framework and some of the earlier proposals is presented in the context of a state-of-art commercial clustered architecture.
16

Spill Code Minimization And Buffer And Code Size Aware Instruction Scheduling Techniques

Nagarakatte, Santosh G 08 1900 (has links)
Instruction scheduling and Software pipelining are important compilation techniques which reorder instructions in a program to exploit instruction level parallelism. They are essential for enhancing instruction level parallelism in architectures such as very Long Instruction Word and tiled processors. This thesis addresses two important problems in the context of these instruction reordering techniques. The first problem is for general purpose applications and architectures, while the second is for media and graphics applications for tiled and multi-core architectures. The first problem deals with software pipelining which is an instruction scheduling technique that overlaps instructions from multiple iterations. Software pipelining increases the register pressure and hence it may be required to introduce spill instructions. In this thesis, we model the problem of register allocation with optimal spill code generation and scheduling in software pipelined loops as a 0-1 integer linear program. By minimizing the amount of spill code produced, the formulation ensures that the initiation interval (II) between successive iterations of the loop is not increased unnecessarily. Experimental results show that our formulation performs better than the existing heuristics by preventing an increase in the II and also generating less spill code on average among loops extracted from Perfect Club and SPEC benchmarks. The second major contribution of the thesis deals with the code size aware scheduling of stream programs. Large scale synchronous dataflow graphs (SDF’s) and StreamIt have emerged as powerful programming models for high performance streaming applications. In these models, a program is represented as a dataflow graph where each node represents an autonomous filter and the edges represent the channels through which the nodes communicate. In constructing static schedules for programs in these models, it is important to optimize the execution time buffer requirements of the data channel and the space required to store the encoded schedule. Earlier approaches have either given priority to one of the requirements or proposed ad-hoc methods for generating schedules with good trade-offs. In this thesis, we propose a genetic algorithm framework based on non-dominated sorting for generating serial schedules which have good trade-off between code size and buffer requirement. We extend the framework to generate software pipelined schedules for tiled architectures. From our experiments, we observe that the genetic algorithm framework generates schedules with good trade-off and performs better than the earlier approaches.
17

Efficient Resource Usage Modelling

Ramanan, V Janaki 04 1900 (has links) (PDF)
No description available.
18

Integrated Software Pipelining

Eriksson, Mattias January 2009 (has links)
In this thesis we address the problem of integrated software pipelining for clustered VLIW architectures. The phases that are integrated and solved as one combined problem are: cluster assignment, instruction selection, scheduling, register allocation and spilling. As a first step we describe two methods for integrated code generation of basic blocks. The first method is optimal and based on integer linear programming. The second method is a heuristic based on genetic algorithms. We then extend the integer linear programming model to modulo scheduling. To the best of our knowledge this is the first time anybody has optimally solved the modulo scheduling problem for clustered architectures with instruction selection and cluster assignment integrated. We also show that optimal spilling is closely related to optimal register allocation when the register files are clustered. In fact, optimal spilling is as simple as adding an additional virtual register file representing the memory and have transfer instructions to and from this register file corresponding to stores and loads. Our algorithm for modulo scheduling iteratively considers schedules with increasing number of schedule slots. A problem with such an iterative method is that if the initiation interval is not equal to the lower bound there is no way to determine whether the found solution is optimal or not. We have proven that for a class of architectures that we call transfer free, we can set an upper bound on the schedule length. I.e., we can prove when a found modulo schedule with initiation interval larger than the lower bound is optimal. Experiments have been conducted to show the usefulness and limitations of our optimal methods. For the basic block case we compare the optimal method to the heuristic based on genetic algorithms. This work has been supported by The Swedish national graduate school in computer science (CUGS) and Vetenskapsrådet (VR).
19

Machine Learning-Based Instruction Scheduling for a DSP Architecture Compiler : Instruction Scheduling using Deep Reinforcement Learning and Graph Convolutional Networks / Maskininlärningsbaserad schemaläggning av instruktioner för en DSP-arkitekturkompilator : Schemaläggning av instruktioner med Deep Reinforcement Learning och grafkonvolutionella nätverk

Alava Peña, Lucas January 2023 (has links)
Instruction Scheduling is a back-end compiler optimisation technique that can provide significant performance gains. It refers to ordering instructions in a particular order to reduce latency for processors with instruction-level parallelism. At the present typical compilers use heuristics to perform instruction scheduling and solve other related non-polynomial complete problems. This thesis aims to present a machine learning-based approach to challenge heuristic methods concerning performance. In this thesis, a novel reinforcement learning (RL) based model for the instruction scheduling problem is developed including modelling features of processors such as forwarding, resource utilisation and treatment of the action space. An efficient optimal scheduler is presented to be used for an optimal schedule length based reward function, however, this is not used in the final results as a heuristic based reward function was deemed to be sufficient and faster to compute. Furthermore, an RL agent that interacts with the model of the problem is presented using three different types of graph neural networks for the state processing: graph conventional networks, graph attention networks, and graph attention based on the work of Lee et al. A simple two-layer neural network is also used for generating embeddings for the resource utilisation stages. The proposed solution is validated against the modelled environment and favourable but not significant improvements were found compared to the most common heuristic method. Furthermore, it was found that having embeddings relating to resource utilisation was very important for the explained variance of the RL models. Additionally, a trained model was tested in an actual compiler, however, no informative results were found likely due to register allocation or other compiler stages that occur after instruction scheduling. Future work should include improving the scalability of the proposed solution. / Instruktionsschemaläggning är en optimeringsteknik för kompilatorer som kan ge betydande prestandavinster. Det handlar om att ordna instruktioner i en viss ordning för att minska latenstiden för processorer med parallellitet på instruktionsnivå. För närvarande använder vanliga kompilatorer heuristiker för att utföra schemaläggning av instruktioner och lösa andra relaterade ickepolynomiala kompletta problem. Denna avhandling syftar till att presentera en maskininlärningsbaserad metod för att utmana heuristiska metoder när det gäller prestanda. I denna avhandling utvecklas en ny förstärkningsinlärningsbaserad (RL) modell för schemaläggning av instruktioner, inklusive modellering av processorns egenskaper såsom vidarebefordran, resursutnyttjande och behandling av handlingsutrymmet. En effektiv optimal schemaläggare presenteras för att eventuellt användas för belöningsfunktionen, men denna används inte i de slutliga resultaten. Dessutom presenteras en RL-agent som interagerar med problemmodellen och använder tre olika typer av grafneurala nätverk för tillståndsprocessering: grafkonventionella nätverk, grafuppmärksamhetsnätverk och grafuppmärksamhet baserat på arbetet av Lee et al. Ett enkelt neuralt nätverk med två lager används också för att generera inbäddningar för resursanvändningsstegen. Den föreslagna lösningen valideras mot den modellerade miljön och gynnsamma men inte signifikanta förbättringar hittades jämfört med den vanligaste heuristiska metoden. Dessutom visade det sig att det var mycket viktigt för den förklarade variansen i RL-modellerna att ha inbäddningar relaterade till resursutnyttjande. Dessutom testades en tränad modell i en verklig kompilator, men inga informativa resultat hittades, sannolikt på grund av registerallokering eller andra kompilatorsteg som inträffar efter schemaläggning av instruktioner. Framtida arbete bör inkludera att förbättra skalbarheten hos den föreslagna lösningen.
20

Register Caching for Energy Efficient GPGPU Tensor Core Computing / Registrera cachelagring för energieffektiv GPGPU Tensor Core Computing

Qian, Qiran January 2023 (has links)
The General-Purpose GPU (GPGPU) has emerged as the predominant computing device for extensive parallel workloads in the fields of Artificial Intelligence (AI) and Scientific Computing, primarily owing to its adoption of the Single Instruction Multiple Thread architecture, which not only provides a wealth of thread context but also effectively hide the latencies exposed in the single threads executions. As computational demands have evolved, modern GPGPUs have incorporated specialized matrix engines, e.g., NVIDIA’s Tensor Core (TC), in order to deliver substantially higher throughput for dense matrix computations compared with traditional scalar or vector architectures. Beyond mere throughput, energy efficiency is a pivotal concern in GPGPU computing. The register file is the largest memory structure on the GPGPU die and typically accounts for over 20% of the dynamic power consumption. To enhance energy efficiency, GPGPUs incorporate a technique named register caching borrowed from the realm of CPUs. Register caching captures temporal locality among register operands to reduce energy consumption within a 2- level register file structure. The presence of TC raises new challenges for Register Cache (RC) design, as each matrix instruction applies intensive operand delivering traffic on the register file banks. In this study, we delve into the RC design trade-offs in GPGPUs. We undertake a comprehensive exploration of the design space, encompassing a range of workloads. Our experiments not only reveal the basic design considerations of RC but also clarify that conventional caching strategies underperform, particularly when dealing with TC computations, primarily due to poor temporal locality and the substantial register operand traffic involved. Based on these findings, we propose an enhanced caching strategy featuring a look-ahead allocation policy to minimize unnecessary cache allocations for the destination register operands. Furthermore, to leverage the energy efficiency of Tensor Core computing, we highlight an alternative instruction scheduling framework for Tensor Core instructions that collaborates with a specialized caching policy, resulting in a remarkable reduction of up to 50% in dynamic energy consumption within the register file during Tensor Core GEMM computations. / Den allmänna ändamålsgrafikprocessorn (GPGPU) har framträtt som den dominerande beräkningsenheten för omfattande parallella arbetsbelastningar inom områdena för artificiell intelligens (AI) och vetenskaplig beräkning, huvudsakligen tack vare dess antagande av arkitekturen för enkel instruktion, flera trådar (Single Instruction Multiple Thread), vilket inte bara ger en mängd trådcontext utan också effektivt döljer de latenser som exponeras vid enskilda trådars utförande. När beräkningskraven har utvecklats har moderna GPGPU:er inkorporerat specialiserade matrismotorer, t.ex., NVIDIAs Tensor Core (TC), för att leverera avsevärt högre genomströmning för täta matrisberäkningar jämfört med traditionella skalär- eller vektorarkitekturer. Bortom endast genomströmning är energieffektivitet en central oro inom GPGPUberäkning. Registerfilen är den största minnesstrukturen på GPGPU-dien och svarar vanligtvis för över 20% av den dynamiska effektförbrukningen För att förbättra energieffektiviteten inkorporerar GPGPU:er en teknik vid namn registercachning, lånad från CPU-världen. Registercachning fångar temporal lokalitet bland registeroperanderna för att minska energiförbrukningen inom en 2-nivåers registerfilstruktur. Närvaron av TC innebär nya utmaningar för Register Cache (RC)-design, eftersom varje matrisinstruktion genererar intensiv operandleverans på registerfilbankarna. I denna studie fördjupar vi oss i RC-designavvägandena i GPGPU:er. Vi genomför en omfattande utforskning av designutrymmet, som omfattar olika arbetsbelastningar. Våra experiment avslöjar inte bara de grundläggande designövervägandena för RC utan klargör också att konventionella cachestrategier underpresterar, särskilt vid hantering av TC-beräkningar, främst på grund av dålig temporal lokalitet och den betydande trafiken med registeroperand. Baserat på dessa resultat föreslår vi en förbättrad cachestrategi med en look-ahead-alloceringspolicy för att minimera onödiga cacheallokeringar för destinationens registeroperand. Dessutom, för att dra nytta av energieffektiviteten hos Tensor Core-beräkning, belyser vi en alternativ instruktionsplaneringsram för Tensor Core-instruktioner som samarbetar med en specialiserad cachelayout, vilket resulterar i en anmärkningsvärd minskning av upp till 50% i dynamisk energiförbrukning inom registerfilen under Tensor Core GEMM-beräkningar.

Page generated in 0.5159 seconds