81 |
Techniques for Managing Irregular Control Flow on GPUsJad Hbeika (5929730) 25 June 2020 (has links)
<p>GPGPU is a highly multithreaded throughput architecture that can deliver high speed-up for regular applications while remaining energy efficient. In recent years, there has been much focus on tuning irregular applications and/or the GPU architecture to achieve similar benefits for irregular applications as well as efforts to extract data parallelism from task parallel applications. In this work we tackle both problems.</p><p>The first part of this work tackles the problem of Control divergence in GPUs. GPGPUs’ SIMT execution model is ineffective for workloads with irregular control-flow because GPGPUs serialize the execution of divergent paths which lead to thread-level parallelism (TLP) loss. Previous works focused on creating new warps based on the control path threads follow, or created different warps for the different paths, or ran multiple narrower warps in parallel. While all previous solutions showed speedup for irregular workloads, they imposed some performance loss on regular workloads. In this work we propose a more fine-grained approach to exploit <i>intra-warp</i>convergence: rather than threads executing the same code path, <i>opcode-convergent threads</i>execute the same instruction, but with potentially different operands. Based on this new definition we find that divergent control blocks within a warp exhibit substantial opcode convergence. We build a compiler that analyzes divergent blocks and identifies the common streams of opcodes. We modify the GPU architecture so that these common instructions are executed as convergent instructions. Using software simulation, we achieve a 17% speedup over baseline GPGPU for irregular workloads and do not incur any performance loss on regular workloads.</p><p>In the second part we suggest techniques for extracting data parallelism from irregular, task parallel applications in order to take advantage of the massive parallelism provided by the GPU. Our technique involves dividing each task into multiple sub-tasks each performing less work and touching a smaller memory footprint. Our framework performs a locality-aware scheduling that works on minimizing the memory footprint of each warp (a set of threads performing in lock-step). We evaluate our framework with 3 task-parallel benchmarks and show that we can achieve significant speedups over optimized GPU code.</p>
|
82 |
Volumetric Terrain Genereation on the GPU : A modern GPGPU approach to Marching Cubes / Volumetrisk terränggenerering på grafikkortet : En modern GPGPU implementation av Marching CubesPethrus Engström, Ludwig January 2015 (has links)
Volumetric visualization is something that has become more interesting during recent years. It has been something that was not feasible in an interactive environment due to its complexity in the 3D space. However, today's technology and access to the power of the graphics processing unit (GPU) has made it feasible to render volumetric data interactively. This thesis explores the possibilities to create and render large volumetric terrain using an implementation of Marching Cubes on the GPU. With the advent of general-purpose computing on the GPU (GPGPU) it has become far easier to implement tradition CPU tasks on the GPU. By utilizing newly available functions in DirectX it is possible to create an easier implementation on the GPU using global buffers. Three implementations are created inside the Unity game engine using compute shaders. The implementations are then compared based on creation time, render times and memory consumption. Then a deeper analysis of the time distribution is presented which suggests that Unity introduces some overhead since copying buffers from GPU to CPU is time consuming. It did however improve render times due to its culling and optimization techniques. The system could be used in applications such as games or medical visualization. Finally some future improvements for culling and level of detail (LOD) techniques are discussed. / Volumetrisk visualisering är en teknik som har fått mer uppmärksamhet dom senaste åren. Det har varit någonting som inte har varit rimligt att göra i en interaktiv miljö på grund av dess komplexitet i 3D rymden. Med dagens teknik och tillgänglighet till grafikkortet (GPU) är det nu möjligt att rendera volumetrisk data i en interaktiv miljö. Den här uppsatsen utforskar möjligheterna till att skapa och rendera stora terräng landskap genom en implementering av Marching Cubes på GPU:n. Med framkomsten av general-purpose computing på grafikkortet(GPGPU) har det blivit lättare att programmera på GPU:n. Genom att använda nya funktioner tillgängliga i DirectX är det möjligt att skapa en enklare implementering på GPU:n som använder sig av globala buffrar. Tre implementeringar har skapats i spelmotorn Unity som använder compute shaders. Implementeringarna är sedan jämförda baserad på tid för generering av terräng, renderings tid samt minnes konsumption. Detta följs av en djupare analys av tidsdistribueringen för skapandet som pekar på att Unity håller tillbaka systemets hastiget pga kopierande av minne från GPU:n till CPU:n. Renderingstiden blev dock bättre med hjälp av den inbyggda culling-teknikerna och optimerings tekniker. Detta system skulle kunna appliceras inom spel eller medicinsk visualisering. Slutligen diskuteras framtida förbättringar för culling-tekniker och level of detail (LOD) tekniker.
|
83 |
Parallel Construction of LocalClearance TriangulationsGummesson, Simon, Johnson, Mikael Unknown Date (has links)
The usage of navigation meshes for path planning in games and otherdomains is a common approach. One type of navigation mesh that recently has beendeveloped is the Local Clearance Triangulation (LCT). The overall aim of the LCT isto construct a triangulation in such a way that a property called theLocal Clearancecan be used to calculate a path in a more efficient and cheap way. At the time ofwriting the thesis there only exists one solution that creates an LCT, this solution isonly using the CPU. Since the process of creating an LCT involves the insertion ofmany points and edge flips which only affects a local area it would be interesting toinvestigate the potential performance gain of using the GPU.Objectives.The objective of the thesis is to develop a GPU version based on thecurrent CPU LCT solution and to investigate in which cases the proposed GPU al-gorithm performs better.Methods.A GPU version and a CPU version of the proposed algorithm has beendeveloped to measure the performance gain of using the GPU, there are no algorith-mic differences between these versions. To measure the performance of the algorithmtwo tests have been constructed, the first test is called the Object Insertion test andmeasures the time it takes to build an LCT using generated test maps. The sec-ond test is called the Internal test and measures the internal performance of thealgorithm. A comparison between the GPU algorithm with an LCT library calledTriplanner was also done.Results.The proposed algorithm performed better on larger maps when imple-mented on a GPU compared to a CPU implementation of the algorithm. The GPUperformance compared to the Triplanner was faster in some of the larger maps.Conclusions.An algorithm that builds an LCT from scratch is presented. Theresults show that using the proposed algorithm on the GPU substantially increasesthe performance of the algorithm compared to when implementing it on a CPU.
|
84 |
Parallel Construction of Local Clearance TriangulationsGummesson, Simon, Johnson, Mikael January 2019 (has links)
The usage of navigation meshes for path planning in games and otherdomains is a common approach. One type of navigation mesh that recently has beendeveloped is the Local Clearance Triangulation (LCT). The overall aim of the LCT isto construct a triangulation in such a way that a property called theLocal Clearancecan be used to calculate a path in a more efficient and cheap way. At the time ofwriting the thesis there only exists one solution that creates an LCT, this solution isonly using the CPU. Since the process of creating an LCT involves the insertion ofmany points and edge flips which only affects a local area it would be interesting toinvestigate the potential performance gain of using the GPU.Objectives.The objective of the thesis is to develop a GPU version based on thecurrent CPU LCT solution and to investigate in which cases the proposed GPU al-gorithm performs better.Methods.A GPU version and a CPU version of the proposed algorithm has beendeveloped to measure the performance gain of using the GPU, there are no algorith-mic differences between these versions. To measure the performance of the algorithmtwo tests have been constructed, the first test is called the Object Insertion test andmeasures the time it takes to build an LCT using generated test maps. The sec-ond test is called the Internal test and measures the internal performance of thealgorithm. A comparison between the GPU algorithm with an LCT library calledTriplanner was also done.Results.The proposed algorithm performed better on larger maps when imple-mented on a GPU compared to a CPU implementation of the algorithm. The GPUperformance compared to the Triplanner was faster in some of the larger maps.Conclusions.An algorithm that builds an LCT from scratch is presented. Theresults show that using the proposed algorithm on the GPU substantially increasesthe performance of the algorithm compared to when implementing it on a CPU.
|
85 |
Parallel Construction of Local Clearance TriangulationsGummesson, Simon, Johnson, Mikael January 2019 (has links)
The usage of navigation meshes for path planning in games and otherdomains is a common approach. One type of navigation mesh that recently has beendeveloped is the Local Clearance Triangulation (LCT). The overall aim of the LCT isto construct a triangulation in such a way that a property called theLocal Clearancecan be used to calculate a path in a more efficient and cheap way. At the time ofwriting the thesis there only exists one solution that creates an LCT, this solution isonly using the CPU. Since the process of creating an LCT involves the insertion ofmany points and edge flips which only affects a local area it would be interesting toinvestigate the potential performance gain of using the GPU. The objective of the thesis is to develop a GPU version based on thecurrent CPU LCT solution and to investigate in which cases the proposed GPU al-gorithm performs better. A GPU version and a CPU version of the proposed algorithm has beendeveloped to measure the performance gain of using the GPU, there are no algorith-mic differences between these versions. To measure the performance of the algorithmtwo tests have been constructed, the first test is called the Object Insertion test andmeasures the time it takes to build an LCT using generated test maps. The sec-ond test is called the Internal test and measures the internal performance of thealgorithm. A comparison between the GPU algorithm with an LCT library called Triplanner was also done. The proposed algorithm performed better on larger maps when imple-mented on a GPU compared to a CPU implementation of the algorithm. The GPU performance compared to the Triplanner was faster in some of the larger maps. An algorithm that builds an LCT from scratch is presented. Theresults show that using the proposed algorithm on the GPU substantially increasesthe performance of the algorithm compared to when implementing it on a CPU.
|
86 |
A Real-Time Predictive Vehicular Collision Avoidance System on an Embedded General-Purpose GPUHegman, Andrew 10 August 2018 (has links)
Collision avoidance is an essential capability for autonomous and assisted-driving ground vehicles. In this work, we developed a novel model predictive control based intelligent collision avoidance (CA) algorithm for a multi-trailer industrial ground vehicle implemented on a General Purpose Graphical Processing Unit (GPGPU). The CA problem is formulated as a multi-objective optimal control problem and solved using a limited look-ahead control scheme in real-time. Through hardware-in-the-loop-simulations and experimental results obtained in this work, we have demonstrated that the proposed algorithm, using NVIDA’s CUDA framework and the NVIDIA Jetson TX2 development platform, is capable of dynamically assisting drivers and maintaining the vehicle a safe distance from the detected obstacles on-thely. We have demonstrated that a GPGPU, paired with an appropriate algorithm, can be the key enabler in relieving the computational burden that is commonly associated with model-based control problems and thus make them suitable for real-time applications.
|
87 |
Millipyde: A Cross-Platform Python Framework for Transparent GPU AccelerationAsbury, James B 01 December 2021 (has links) (PDF)
The prevalence of general-purpose GPU computing continues to grow and tackle a wider variety of problems that benefit from GPU-acceleration. This acceleration often suffers from a high barrier to entry, however, due to the complexity of software tools that closely map to the underlying GPU hardware, the fast-changing landscape of GPU environments, and the fragmentation of tools and languages that only support specific platforms. Because of this, new solutions will continue to be needed to make GPGPU acceleration more accessible to the developers that can benefit from it. AMD’s new cross-platform development ecosystem ROCm provides promise for developing applications and solutions that work across systems running both AMD and non-AMD GPU computing hardware.
This thesis presents Millipyde, a framework for GPU acceleration in Python using AMD’s ROCm. Millipyde includes two new types, the gpuarray and gpuimage, as well as three new constructs for building GPU-accelerated applications – the Operation, Pipeline, and Generator. Using these tools, Millipyde hopes to make it easier for engineers and researchers to write GPU-accelerated code in Python. Millipyde also has the potential to schedule work across many GPUs in complex multi-device environments. These capabilities will be demonstrated in a sample application of augmenting images on-device for machine learning applications. Our results showed that Millipyde is capable of making individual image-related transformations up to around 200 times faster than their CPU-only equivalents. Constructs such as the Millipyde’s Pipeline was also able to additionally improve performance in certain situations, and it performed best when it was allowed to transparently schedule work across multiple devices.
|
88 |
An Optimization Compiler Framework Based on Polyhedron Model for GPGPUsLiu, Lifeng 31 May 2017 (has links)
No description available.
|
89 |
Optimal Loop Unrolling for GPGPU ProgramsSreenivasa Murthy, Giridhar 30 September 2009 (has links)
No description available.
|
90 |
Optimizing All-to-All and Allgather Communications on GPGPU ClustersSingh, Ashish Kumar 25 June 2012 (has links)
No description available.
|
Page generated in 0.0431 seconds