• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 138
  • 41
  • 23
  • 16
  • 15
  • 9
  • 8
  • 5
  • 2
  • 2
  • 2
  • 1
  • 1
  • Tagged with
  • 303
  • 107
  • 104
  • 104
  • 60
  • 52
  • 50
  • 47
  • 46
  • 39
  • 31
  • 30
  • 30
  • 29
  • 29
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
81

Techniques for Managing Irregular Control Flow on GPUs

Jad Hbeika (5929730) 25 June 2020 (has links)
<p>GPGPU is a highly multithreaded throughput architecture that can deliver high speed-up for regular applications while remaining energy efficient. In recent years, there has been much focus on tuning irregular applications and/or the GPU architecture to achieve similar benefits for irregular applications as well as efforts to extract data parallelism from task parallel applications. In this work we tackle both problems.</p><p>The first part of this work tackles the problem of Control divergence in GPUs. GPGPUs’ SIMT execution model is ineffective for workloads with irregular control-flow because GPGPUs serialize the execution of divergent paths which lead to thread-level parallelism (TLP) loss. Previous works focused on creating new warps based on the control path threads follow, or created different warps for the different paths, or ran multiple narrower warps in parallel. While all previous solutions showed speedup for irregular workloads, they imposed some performance loss on regular workloads. In this work we propose a more fine-grained approach to exploit <i>intra-warp</i>convergence: rather than threads executing the same code path, <i>opcode-convergent threads</i>execute the same instruction, but with potentially different operands. Based on this new definition we find that divergent control blocks within a warp exhibit substantial opcode convergence. We build a compiler that analyzes divergent blocks and identifies the common streams of opcodes. We modify the GPU architecture so that these common instructions are executed as convergent instructions. Using software simulation, we achieve a 17% speedup over baseline GPGPU for irregular workloads and do not incur any performance loss on regular workloads.</p><p>In the second part we suggest techniques for extracting data parallelism from irregular, task parallel applications in order to take advantage of the massive parallelism provided by the GPU. Our technique involves dividing each task into multiple sub-tasks each performing less work and touching a smaller memory footprint. Our framework performs a locality-aware scheduling that works on minimizing the memory footprint of each warp (a set of threads performing in lock-step). We evaluate our framework with 3 task-parallel benchmarks and show that we can achieve significant speedups over optimized GPU code.</p>
82

Volumetric Terrain Genereation on the GPU : A modern GPGPU approach to Marching Cubes / Volumetrisk terränggenerering på grafikkortet : En modern GPGPU implementation av Marching Cubes

Pethrus Engström, Ludwig January 2015 (has links)
Volumetric visualization is something that has become more interesting during recent years. It has been something that was not feasible in an interactive environment due to its complexity in the 3D space. However, today's technology and access to the power of the graphics processing unit (GPU) has made it feasible to render volumetric data interactively. This thesis explores the possibilities to create and render large volumetric terrain using an implementation of Marching Cubes on the GPU. With the advent of general-purpose computing on the GPU (GPGPU) it has become far easier to implement tradition CPU tasks on the GPU. By utilizing newly available functions in DirectX it is possible to create an easier implementation on the GPU using global buffers. Three implementations are created inside the Unity game engine using compute shaders. The implementations are then compared based on creation time, render times and memory consumption. Then a deeper analysis of the time distribution is presented which suggests that Unity introduces some overhead since copying buffers from GPU to CPU is time consuming. It did however improve render times due to its culling and optimization techniques. The system could be used in applications such as games or medical visualization. Finally some future improvements for culling and level of detail (LOD) techniques are discussed. / Volumetrisk visualisering är en teknik som har fått mer uppmärksamhet dom senaste åren. Det har varit någonting som inte har varit rimligt att göra i en interaktiv miljö på grund av dess komplexitet i 3D rymden. Med dagens teknik och tillgänglighet till grafikkortet (GPU) är det nu möjligt att rendera volumetrisk data i en interaktiv miljö. Den här uppsatsen utforskar möjligheterna till att skapa och rendera stora terräng landskap genom en implementering av Marching Cubes på GPU:n. Med framkomsten av general-purpose computing på grafikkortet(GPGPU) har det blivit lättare att programmera på GPU:n. Genom att använda nya funktioner tillgängliga i DirectX är det möjligt att skapa en enklare implementering på GPU:n som använder sig av globala buffrar. Tre implementeringar har skapats i spelmotorn Unity som använder compute shaders. Implementeringarna är sedan jämförda baserad på tid för generering av terräng, renderings tid samt minnes konsumption. Detta följs av en djupare analys av tidsdistribueringen för skapandet som pekar på att Unity håller tillbaka systemets hastiget pga kopierande av minne från GPU:n till CPU:n. Renderingstiden blev dock bättre med hjälp av den inbyggda culling-teknikerna och optimerings tekniker. Detta system skulle kunna appliceras inom spel eller medicinsk visualisering. Slutligen diskuteras framtida förbättringar för culling-tekniker och level of detail (LOD) tekniker.
83

Parallel Construction of LocalClearance Triangulations

Gummesson, Simon, Johnson, Mikael Unknown Date (has links)
The usage of navigation meshes for path planning in games and otherdomains is a common approach. One type of navigation mesh that recently has beendeveloped is the Local Clearance Triangulation (LCT). The overall aim of the LCT isto construct a triangulation in such a way that a property called theLocal Clearancecan be used to calculate a path in a more efficient and cheap way. At the time ofwriting the thesis there only exists one solution that creates an LCT, this solution isonly using the CPU. Since the process of creating an LCT involves the insertion ofmany points and edge flips which only affects a local area it would be interesting toinvestigate the potential performance gain of using the GPU.Objectives.The objective of the thesis is to develop a GPU version based on thecurrent CPU LCT solution and to investigate in which cases the proposed GPU al-gorithm performs better.Methods.A GPU version and a CPU version of the proposed algorithm has beendeveloped to measure the performance gain of using the GPU, there are no algorith-mic differences between these versions. To measure the performance of the algorithmtwo tests have been constructed, the first test is called the Object Insertion test andmeasures the time it takes to build an LCT using generated test maps. The sec-ond test is called the Internal test and measures the internal performance of thealgorithm. A comparison between the GPU algorithm with an LCT library calledTriplanner was also done.Results.The proposed algorithm performed better on larger maps when imple-mented on a GPU compared to a CPU implementation of the algorithm. The GPUperformance compared to the Triplanner was faster in some of the larger maps.Conclusions.An algorithm that builds an LCT from scratch is presented. Theresults show that using the proposed algorithm on the GPU substantially increasesthe performance of the algorithm compared to when implementing it on a CPU.
84

Parallel Construction of Local Clearance Triangulations

Gummesson, Simon, Johnson, Mikael January 2019 (has links)
The usage of navigation meshes for path planning in games and otherdomains is a common approach. One type of navigation mesh that recently has beendeveloped is the Local Clearance Triangulation (LCT). The overall aim of the LCT isto construct a triangulation in such a way that a property called theLocal Clearancecan be used to calculate a path in a more efficient and cheap way. At the time ofwriting the thesis there only exists one solution that creates an LCT, this solution isonly using the CPU. Since the process of creating an LCT involves the insertion ofmany points and edge flips which only affects a local area it would be interesting toinvestigate the potential performance gain of using the GPU.Objectives.The objective of the thesis is to develop a GPU version based on thecurrent CPU LCT solution and to investigate in which cases the proposed GPU al-gorithm performs better.Methods.A GPU version and a CPU version of the proposed algorithm has beendeveloped to measure the performance gain of using the GPU, there are no algorith-mic differences between these versions. To measure the performance of the algorithmtwo tests have been constructed, the first test is called the Object Insertion test andmeasures the time it takes to build an LCT using generated test maps. The sec-ond test is called the Internal test and measures the internal performance of thealgorithm. A comparison between the GPU algorithm with an LCT library calledTriplanner was also done.Results.The proposed algorithm performed better on larger maps when imple-mented on a GPU compared to a CPU implementation of the algorithm. The GPUperformance compared to the Triplanner was faster in some of the larger maps.Conclusions.An algorithm that builds an LCT from scratch is presented. Theresults show that using the proposed algorithm on the GPU substantially increasesthe performance of the algorithm compared to when implementing it on a CPU.
85

Parallel Construction of Local Clearance Triangulations

Gummesson, Simon, Johnson, Mikael January 2019 (has links)
The usage of navigation meshes for path planning in games and otherdomains is a common approach. One type of navigation mesh that recently has beendeveloped is the Local Clearance Triangulation (LCT). The overall aim of the LCT isto construct a triangulation in such a way that a property called theLocal Clearancecan be used to calculate a path in a more efficient and cheap way. At the time ofwriting the thesis there only exists one solution that creates an LCT, this solution isonly using the CPU. Since the process of creating an LCT involves the insertion ofmany points and edge flips which only affects a local area it would be interesting toinvestigate the potential performance gain of using the GPU. The objective of the thesis is to develop a GPU version based on thecurrent CPU LCT solution and to investigate in which cases the proposed GPU al-gorithm performs better. A GPU version and a CPU version of the proposed algorithm has beendeveloped to measure the performance gain of using the GPU, there are no algorith-mic differences between these versions. To measure the performance of the algorithmtwo tests have been constructed, the first test is called the Object Insertion test andmeasures the time it takes to build an LCT using generated test maps. The sec-ond test is called the Internal test and measures the internal performance of thealgorithm. A comparison between the GPU algorithm with an LCT library called Triplanner was also done. The proposed algorithm performed better on larger maps when imple-mented on a GPU compared to a CPU implementation of the algorithm. The GPU performance compared to the Triplanner was faster in some of the larger maps. An algorithm that builds an LCT from scratch is presented. Theresults show that using the proposed algorithm on the GPU substantially increasesthe performance of the algorithm compared to when implementing it on a CPU.
86

On the Enhancement of Remote GPU Virtualization in High Performance Clusters

Reaño González, Carlos 01 September 2017 (has links)
Graphics Processing Units (GPUs) are being adopted in many computing facilities given their extraordinary computing power, which makes it possible to accelerate many general purpose applications from different domains. However, GPUs also present several side effects, such as increased acquisition costs as well as larger space requirements. They also require more powerful energy supplies. Furthermore, GPUs still consume some amount of energy while idle and their utilization is usually low for most workloads. In a similar way to virtual machines, the use of virtual GPUs may address the aforementioned concerns. In this regard, the remote GPU virtualization mechanism allows an application being executed in a node of the cluster to transparently use the GPUs installed at other nodes. Moreover, this technique allows to share the GPUs present in the computing facility among the applications being executed in the cluster. In this way, several applications being executed in different (or the same) cluster nodes can share one or more GPUs located in other nodes of the cluster. Sharing GPUs should increase overall GPU utilization, thus reducing the negative impact of the side effects mentioned before. Reducing the total amount of GPUs installed in the cluster may also be possible. In this dissertation we enhance one framework offering remote GPU virtualization capabilities, referred to as rCUDA, for its use in high-performance clusters. While the initial prototype version of rCUDA demonstrated its functionality, it also revealed concerns with respect to usability, performance, and support for new GPU features, which prevented its used in production environments. These issues motivated this thesis, in which all the research is primarily conducted with the aim of turning rCUDA into a production-ready solution for eventually transferring it to industry. The new version of rCUDA resulting from this work presents a reduction of up to 35% in execution time of the applications analyzed with respect to the initial version. Compared to the use of local GPUs, the overhead of this new version of rCUDA is below 5% for the applications studied when using the latest high-performance computing networks available. / Las unidades de procesamiento gráfico (Graphics Processing Units, GPUs) están siendo utilizadas en muchas instalaciones de computación dada su extraordinaria capacidad de cálculo, la cual hace posible acelerar muchas aplicaciones de propósito general de diferentes dominios. Sin embargo, las GPUs también presentan algunas desventajas, como el aumento de los costos de adquisición, así como mayores requerimientos de espacio. Asimismo, también requieren un suministro de energía más potente. Además, las GPUs consumen una cierta cantidad de energía aún estando inactivas, y su utilización suele ser baja para la mayoría de las cargas de trabajo. De manera similar a las máquinas virtuales, el uso de GPUs virtuales podría hacer frente a los inconvenientes mencionados. En este sentido, el mecanismo de virtualización remota de GPUs permite que una aplicación que se ejecuta en un nodo de un clúster utilice de forma transparente las GPUs instaladas en otros nodos de dicho clúster. Además, esta técnica permite compartir las GPUs presentes en el clúster entre las aplicaciones que se ejecutan en el mismo. De esta manera, varias aplicaciones que se ejecutan en diferentes nodos de clúster (o los mismos) pueden compartir una o más GPUs ubicadas en otros nodos del clúster. Compartir GPUs aumenta la utilización general de la GPU, reduciendo así el impacto negativo de las desventajas anteriormente mencionadas. De igual forma, este mecanismo también permite reducir la cantidad total de GPUs instaladas en el clúster. En esta tesis mejoramos un entorno de trabajo llamado rCUDA, el cual ofrece funcionalidades de virtualización remota de GPUs para su uso en clusters de altas prestaciones. Si bien la versión inicial del prototipo de rCUDA demostró su funcionalidad, también reveló dificultades con respecto a la usabilidad, el rendimiento y el soporte para nuevas características de las GPUs, lo cual impedía su uso en entornos de producción. Estas consideraciones motivaron la presente tesis, en la que toda la investigación llevada a cabo tiene como objetivo principal convertir rCUDA en una solución lista para su uso entornos de producción, con la finalidad de transferirla eventualmente a la industria. La nueva versión de rCUDA resultante de este trabajo presenta una reducción de hasta el 35% en el tiempo de ejecución de las aplicaciones analizadas con respecto a la versión inicial. En comparación con el uso de GPUs locales, la sobrecarga de esta nueva versión de rCUDA es inferior al 5% para las aplicaciones estudiadas cuando se utilizan las últimas redes de computación de altas prestaciones disponibles. / Les unitats de processament gràfic (Graphics Processing Units, GPUs) estan sent utilitzades en moltes instal·lacions de computació donada la seva extraordinària capacitat de càlcul, la qual fa possible accelerar moltes aplicacions de propòsit general de diferents dominis. No obstant això, les GPUs també presenten alguns desavantatges, com l'augment dels costos d'adquisició, així com major requeriment d'espai. Així mateix, també requereixen un subministrament d'energia més potent. A més, les GPUs consumeixen una certa quantitat d'energia encara estant inactives, i la seua utilització sol ser baixa per a la majoria de les càrregues de treball. D'una manera semblant a les màquines virtuals, l'ús de GPUs virtuals podria fer front als inconvenients esmentats. En aquest sentit, el mecanisme de virtualització remota de GPUs permet que una aplicació que s'executa en un node d'un clúster utilitze de forma transparent les GPUs instal·lades en altres nodes d'aquest clúster. A més, aquesta tècnica permet compartir les GPUs presents al clúster entre les aplicacions que s'executen en el mateix. D'aquesta manera, diverses aplicacions que s'executen en diferents nodes de clúster (o els mateixos) poden compartir una o més GPUs ubicades en altres nodes del clúster. Compartir GPUs augmenta la utilització general de la GPU, reduint així l'impacte negatiu dels desavantatges anteriorment esmentades. A més a més, aquest mecanisme també permet reduir la quantitat total de GPUs instal·lades al clúster. En aquesta tesi millorem un entorn de treball anomenat rCUDA, el qual ofereix funcionalitats de virtualització remota de GPUs per al seu ús en clústers d'altes prestacions. Si bé la versió inicial del prototip de rCUDA va demostrar la seua funcionalitat, també va revelar dificultats pel que fa a la usabilitat, el rendiment i el suport per a noves característiques de les GPUs, la qual cosa impedia el seu ús en entorns de producció. Aquestes consideracions van motivar la present tesi, en què tota la investigació duta a terme té com a objectiu principal convertir rCUDA en una solució preparada per al seu ús entorns de producció, amb la finalitat de transferir-la eventualment a la indústria. La nova versió de rCUDA resultant d'aquest treball presenta una reducció de fins al 35% en el temps d'execució de les aplicacions analitzades respecte a la versió inicial. En comparació amb l'ús de GPUs locals, la sobrecàrrega d'aquesta nova versió de rCUDA és inferior al 5% per a les aplicacions estudiades quan s'utilitzen les últimes xarxes de computació d'altes prestacions disponibles. / Reaño González, C. (2017). On the Enhancement of Remote GPU Virtualization in High Performance Clusters [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/86219 / TESIS / Premios Extraordinarios de tesis doctorales
87

A Real-Time Predictive Vehicular Collision Avoidance System on an Embedded General-Purpose GPU

Hegman, Andrew 10 August 2018 (has links)
Collision avoidance is an essential capability for autonomous and assisted-driving ground vehicles. In this work, we developed a novel model predictive control based intelligent collision avoidance (CA) algorithm for a multi-trailer industrial ground vehicle implemented on a General Purpose Graphical Processing Unit (GPGPU). The CA problem is formulated as a multi-objective optimal control problem and solved using a limited look-ahead control scheme in real-time. Through hardware-in-the-loop-simulations and experimental results obtained in this work, we have demonstrated that the proposed algorithm, using NVIDA’s CUDA framework and the NVIDIA Jetson TX2 development platform, is capable of dynamically assisting drivers and maintaining the vehicle a safe distance from the detected obstacles on-thely. We have demonstrated that a GPGPU, paired with an appropriate algorithm, can be the key enabler in relieving the computational burden that is commonly associated with model-based control problems and thus make them suitable for real-time applications.
88

Millipyde: A Cross-Platform Python Framework for Transparent GPU Acceleration

Asbury, James B 01 December 2021 (has links) (PDF)
The prevalence of general-purpose GPU computing continues to grow and tackle a wider variety of problems that benefit from GPU-acceleration. This acceleration often suffers from a high barrier to entry, however, due to the complexity of software tools that closely map to the underlying GPU hardware, the fast-changing landscape of GPU environments, and the fragmentation of tools and languages that only support specific platforms. Because of this, new solutions will continue to be needed to make GPGPU acceleration more accessible to the developers that can benefit from it. AMD’s new cross-platform development ecosystem ROCm provides promise for developing applications and solutions that work across systems running both AMD and non-AMD GPU computing hardware. This thesis presents Millipyde, a framework for GPU acceleration in Python using AMD’s ROCm. Millipyde includes two new types, the gpuarray and gpuimage, as well as three new constructs for building GPU-accelerated applications – the Operation, Pipeline, and Generator. Using these tools, Millipyde hopes to make it easier for engineers and researchers to write GPU-accelerated code in Python. Millipyde also has the potential to schedule work across many GPUs in complex multi-device environments. These capabilities will be demonstrated in a sample application of augmenting images on-device for machine learning applications. Our results showed that Millipyde is capable of making individual image-related transformations up to around 200 times faster than their CPU-only equivalents. Constructs such as the Millipyde’s Pipeline was also able to additionally improve performance in certain situations, and it performed best when it was allowed to transparently schedule work across multiple devices.
89

An Optimization Compiler Framework Based on Polyhedron Model for GPGPUs

Liu, Lifeng 31 May 2017 (has links)
No description available.
90

Optimal Loop Unrolling for GPGPU Programs

Sreenivasa Murthy, Giridhar 30 September 2009 (has links)
No description available.

Page generated in 0.1757 seconds