1 |
Code generation and adaptive control divergence management for light weight SIMT processorsGupta, Meghana 27 May 2016 (has links)
The energy costs of data movement are limiting the performance scaling of future generations of high performance computing architectures targeted to data intensive applications. The result has been a resurgence in the interest in processing-in-memory (PIM) architectures. This challenge has spawned the development of a scalable, parametric data parallel architecture referred at the Heterogeneous Architecture Research Prototype (HARP) - a single instruction multiple thread (SIMT) architecture for integration into DRAM systems, particularly 3D memory stacks as a distinct processing layer to exploit the enormous internal memory bandwidth. However, this potential can only be realized with an optimizing compilation environment. This thesis addresses this challenge by i) the construction of an open source compiler for HARP, and ii) integrating optimizations for handling control flow divergence for HARP instances. The HARP compiler is built using the LLVM open source compiler infrastructure. Apart from traditional code generation, the HARP compiler backend handles unique challenges associated with the HARP instruction set. Chief among them is code generation for control divergence management techniques. The HARP architecture and compiler supports i) a hardware reconvergence stack and ii) predication to handle divergent branches. The HARP compiler addresses several challenges associated with generating code for these two control divergence management techniques and implements multiple analyses and transformations for code generation. Both of these techniques have unique advantages and disadvantages depending upon whether the conditional branch is likely to be unanimous or not. Two decision frameworks, guided by static analysis and dynamic profile information are implemented to choose between the control divergence management techniques by analyzing the nature of the conditional branches and utilizing this information during compilation.
|
2 |
Adaptation du calcul de la Transformée de Fourier Rapide sur une architecture mixte CPU/GPU intégrée / Adaptation of the Fast Fourier Transform processing on hybride integrated CPU/GPU architectureBergach, Mohamed Amine 02 October 2015 (has links)
Les architectures multi-cœurs Intel Core (IvyBridge, Haswell,...) contiennent à la fois des cœurs CPU généralistes (4), mais aussi des cœurs dédiés GPU embarqués sur cette même puce (16 et 40 respectivement). Dans le cadre de l'activité de la société Kontron (qui participe à ce financement de nature CIFRE) un objectif important est de calculer efficacement sur cette architecture des tableaux et séquences de transformées de Fourier rapides (FFT), comme par exemple on en trouve dans des applications radar. Alors que des bibliothèques natives (mais propriétaires) existent chez Intel pour les CPU, rien de tel n'est actuellement disponible pour la partie GPU. L'objectif de la thèse était donc de définir le placement efficace de modules FFT, en étudiant au niveau théorique la forme optimale permettant de regrouper des étages de calcul d'une telle FFT en fonction de la localité des données sur un cœur de calcul unique. Ce choix a priori permet d'espérer une efficacité des traitements, en ajustant la taille de la mémoire disponible à celles des données nécessaires. Ensuite la multiplicité des cœurs reste exploitable pour disposer plusieurs FFT calculées en parallèle, sans interférence (sauf contention du bus entre CPU et GPU). Nous avons obtenu des résultats significatifs, tant au niveau de l'implantation d'une FFT (1024 points) sur un cœur CPU SIMD, exprimée en langage C, que pour l'implantation d'une FFT de même taille sur un cœur GPU SIMT, exprimée alors en OpenCL. De plus nos résultats permettent de définir des règles pour synthétiser automatiquement de telles solutions, en fonction uniquement de la taille de la FFT son nombre d'étages plus précisément), et de la taille de la mémoire locale pour un coeur de calcul donné. Les performances obtenues sont supérieures à celles de la bibliothèque native Intel pour CPU), et démontrent un gain important de consommation sur GPU. Tous ces points sont détaillés dans le document de thèse. Ces résultats devraient donner lieu à exploitation au sein de la société Kontron. / Multicore architectures Intel Core (IvyBridge, Haswell…) contain both general purpose CPU cores (4) and dedicated GPU cores embedded on the same chip (16 and 40 respectively). As part of the activity of Kontron (the company partially funding this CIFRE scholarship), an important objective is to efficiently compute arrays and sequences of fast Fourier transforms (FFT) such as one finds in radar applications, on this architecture. While native (but proprietary) libraries exist for Intel CPU, nothing is currently available for the GPU part.The aim of the thesis was to define the efficient placement of FFT modules, and to study theoretically the optimal form for grouping computing stages of such FFT according to data locality on a single computing core. This choice should allow processing efficiency, by adjusting the memory size available to the required application data size. Then the multiplicity of cores is exploitable to compute several FFT in parallel, without interference (except for possible bus contention between the CPU and the GPU). We have achieved significant results, both in the implementation of an FFT (1024 points) on a SIMD CPU core, expressed in C, and in the implementation of a FFT of the same size on a GPU SIMT core, then expressed in OpenCL. In addition, our results allow to define rules to automatically synthesize such solutions, based solely on the size of the FFT (more specifically its number of stages), and the size of the local memory for a given computing core. The performances obtained are better than the native Intel library for CPU, and demonstrate a significant gain in consumption on GPU. All these points are detailed in the thesis document.
|
3 |
Parallel Instruction Decoding for DSP Controllers with Decoupled Execution UnitsPettersson, Andreas January 2019 (has links)
Applications run on embedded processors are constantly evolving. They are for the most part growing more complex and the processors have to increase their performance to keep up. In this thesis, an embedded DSP SIMT processor with decoupled execution units is under investigation. A SIMT processor exploits the parallelism gained from issuing instructions to functional units or to decoupled execution units. In its basic form only a single instruction is issued per cycle. If the control of the decoupled execution units become too fine-grained or if the control burden of the master core becomes sufficiently high, the fetching and decoding of instructions can become a bottleneck of the system. This thesis investigates how to parallelize the instruction fetch, decode and issue process. Traditional parallel fetch and decode methods in superscalar and VLIW architectures are investigated. Benefits and drawbacks of the two are presented and discussed. One superscalar design and one VLIW design are implemented in RTL, and their costs and performances are compared using a benchmark program and synthesis. It is found that both the superscalar and the VLIW designs outperform a baseline scalar processor as expected, with the VLIW design performing slightly better than the superscalar design. The VLIW design is found to be able to achieve a higher clock frequency, with an area comparable to the area of the superscalar design. This thesis also investigates how instructions can be encoded to lower the decode complexity and increase the speed of issue to decoupled execution units. A number of possible encodings are proposed and discussed. Simulations show that the encodings have a possibility to considerably lower the time spent issuing to decoupled execution units.
|
4 |
Evaluating Plan Quality for Multi-Target Brain Radiosurgery: Single Iso Multi-Target vsSingle Iso Single Target PlanningByrne, Justin Joseph 11 July 2022 (has links)
No description available.
|
5 |
GPU Volume Voxelization : Exploration of the performance characteristics of different GPU-based implementationsGlukhov, Grigory, Soltan, Aleksandra January 2019 (has links)
In recent years, voxel-based modelling has seen a reintroduction to computer game development through massive graphics hardware improvements. Never- theless, polygons continue to be the default building block of 3D objects, intro- ducing a need for the transformation of polygon meshes into voxel-based models; this process is known as voxelization. Efficient voxelization algorithms take ad- vantage of the flexibility and control offered by modern, programmable GPU pipelines. However, the variability in possible approaches poses the question of how different GPU-based implementations affect voxelization performance.This thesis explores the impact of GPU-based improvements by comparing four different implementations of a solid voxelization algorithm. The implemen- tations include a naive transition from the CPU to the GPU, a non-branching execution path approach, data pre-processing, and a combination of the two previous approaches. Benchmarking experiments run on four, standard polygo- nal models and three graphics cards (NVIDIA and AMD) provide runtime and memory usage data for each implementation. A comparative analysis is per- formed on the basis of this data to determine the performance impact of the GPU-based adjustments to the voxelization algorithm implementation.Results indicate that the non-branching execution path approach yields clear improvements over the naive implementation, while data pre-processing has in- consistent performance and a large initial performance cost; the combination of the two improvements unsurprisingly leads to combined results. Therefore, the conclusive recommendation is using the non-branching execution path technique for GPU-based improvements. / Voxel-baserad modellering har på senare år blivit återintroducerat till datorspelsutveckling tack vare massiva förbättringar i grafikhårdvara. Trots detta fortsätter polygoner att vara standarden för uppbyggnaden av 3D-objekt. Detta gör det nödvändigt att kunna transformera polygonytor till voxel-baserade modeller; denna process kallas för voxelisering. Effektiva voxeliseringsalgoritmer tar vara på den flexibilitet och kontroll som ges av moderna, programmerbara GPU-pipelines. Variationen i möjliga tillvägagångssätt gör det dock intressant att veta hur olika GPU-baserade implementationer påverkar prestandan av voxeliseringen. Denna avhandling undersöker påverkan av GPU-baserade förbättringar genom att jämföra fyra olika implementationer av en solid-voxeliseringsalgoritm. Implementationerna inkluderar en naiv övergång från CPU:n till GPU:n, en metod med en non-branching exekveringsväg, förbehandling av data, och en kombination av det två tidigare metoderna. Benchmarkingexperiment görs på fyra standardpolygonmodeller och tre grafikkort (NVIDIA och AMD) förser data för exekveringstid och minnesåtgång för varje implementation. En jämförande analys görs med detta data som grund för att bestämma den påverkan som de GPU-baserade ändringarna har på prestandan av voxeliseringsalgoritmens implementation. Resultaten indikerar att implementationen med en non-branching exekveringsväg ger klara förbättringar över den naiva implementationen, medans förbehandlingen av data presterar inkonsekvent och har en stor initial prestandakostnad; kombinationen av dem båda ledde, inte överraskande, till blandade resultat. Den slutgiltiga rekommendationen är således att använda tekniken med en non-branching exekveringsväg för GPU-baserade förbättringar.
|
6 |
Scalable and Energy-Efficient SIMT Systems for Deep Learning and Data Center MicroservicesMahmoud Khairy A. Abdallah (12894191) 04 July 2022 (has links)
<p> </p>
<p>Moore’s law is dead. The physical and economic principles that enabled an exponential rise in transistors per chip have reached their breaking point. As a result, High-Performance Computing (HPC) domain and cloud data centers are encountering significant energy, cost, and environmental hurdles that have led them to embrace custom hardware/software solutions. Single Instruction Multiple Thread (SIMT) accelerators, like Graphics Processing Units (GPUs), are compelling solutions to achieve considerable energy efficiency while still preserving programmability in the twilight of Moore’s Law.</p>
<p>In the HPC and Deep Learning (DL) domain, the death of single-chip GPU performance scaling will usher in a renaissance in multi-chip Non-Uniform Memory Access (NUMA) scaling. Advances in silicon interposers and other inter-chip signaling technology will enable single-package systems, composed of multiple chiplets that continue to scale even as per-chip transistors do not. Given this evolving, massively parallel NUMA landscape, the placement of data on each chiplet, or discrete GPU card, and the scheduling of the threads that use that data is a critical factor in system performance and power consumption.</p>
<p>Aside from the supercomputer space, general-purpose compute units are still the main driver of data center’s total cost of ownership (TCO). CPUs consume 60% of the total data center power budget, half of which comes from the CPU pipeline’s frontend. Coupled with the hardware efficiency crisis is an increased desire for programmer productivity, flexible scalability, and nimble software updates that have led to the rise of software microservices. Consequently, single servers are now packed with many threads executing the same, relatively small task on different data.</p>
<p>In this dissertation, I discuss these new paradigm shifts, addressing the following concerns: (1) how do we overcome the non-uniform memory access overhead for next-generation multi-chiplet GPUs in the era of DL-driven workloads?; (2) how can we improve the energy efficiency of data center’s CPUs in the light of microservices evolution and request similarity?; and (3) how to study such rapidly-evolving systems with an accurate and extensible SIMT performance modeling?</p>
|
Page generated in 0.0444 seconds