251 |
Interaktivní simulace chování tkaniny akcelerovaná pomocí GPU / Interactive Cloth Simulation Accelerated by GPUMelichar, Vojtěch January 2016 (has links)
This master thesis deals with interactive cloth simulation accelerated by GPU. In the first part there is a description of all technologies used during implementation of a program. The second part discusses various simulation methods. It is mainly focused on particle systems as a most used method. These parts are followed by a design of the program, which is implemented as a part of this thesis. The program was implemented in four variants. The first variant is CPU implementation, which was then optimalized with OpenMP. CUDA implementation is based on these implementations. Last variant implemented in this thesis is optimized CUDA implementation. All these implementations are evaluated from compute complexity point of view and suitability for real time graphics.
|
252 |
Implementing and Comparing Static and Machine-Learning scheduling Approaches using DPDK on an Integrated CPU/GPUJohansson, Markus, Pap, Oscar January 2019 (has links)
As 5G is getting closer to being commercially available, base stations processing this traffic must be improved to be able to handle the increase in traffic and demand for lower latencies. By utilizing the hardware smarter, the processing of data can be accelerated in, for example, the forwarding plane where baseband and encryption are common tasks. With this in mind, systems with integrated GPUs becomes interesting for their additional processing power and lack of need for PCIe buses.This thesis aims to implement the DPDK framework on the Nvidia Jetson Xavier system and investigate if a scheduler based on the theoretical properties of each platform is better than a self-exploring machine learning scheduler based on packet latency and throughput, and how they stand against a simple round-robin scheduler. It will also examine if it is more beneficial to have a more flexible scheduler with more overhead than a more static scheduler with less overhead. The conclusion drawn from this is that there are a number of challenges for processing and scheduling on an integrated system. Effective batch aggregation during low traffic rates and how different processes affect each other became the main challenges.
|
253 |
Accelerating Scientific Applications using High Performance Dense and Sparse Linear Algebra Kernels on GPUsAbdelfattah, Ahmad 15 January 2015 (has links)
High performance computing (HPC) platforms are evolving to more heterogeneous configurations to support the workloads of various applications. The current hardware landscape is composed of traditional multicore CPUs equipped with hardware accelerators that can handle high levels of parallelism. Graphical Processing Units (GPUs) are popular high performance hardware accelerators in modern supercomputers. GPU programming has a different model than that for CPUs, which means that many numerical kernels have to be redesigned and optimized specifically for this architecture. GPUs usually outperform multicore CPUs in some compute intensive and massively parallel applications that have regular processing patterns. However, most scientific applications rely on crucial memory-bound kernels and may witness bottlenecks due to the overhead of the memory bus latency. They can still take advantage of the GPU compute power capabilities, provided that an efficient architecture-aware design is achieved.
This dissertation presents a uniform design strategy for optimizing critical memory-bound kernels on GPUs. Based on hierarchical register blocking, double buffering and latency hiding techniques, this strategy leverages the performance of a wide range of standard numerical kernels found in dense and sparse linear algebra libraries. The work presented here focuses on matrix-vector multiplication kernels (MVM) as repre-
sentative and most important memory-bound operations in this context. Each kernel inherits the benefits of the proposed strategies. By exposing a proper set of tuning parameters, the strategy is flexible enough to suit different types of matrices, ranging from large dense matrices, to sparse matrices with dense block structures, while high performance is maintained. Furthermore, the tuning parameters are used to maintain the relative performance across different GPU architectures. Multi-GPU acceleration is proposed to scale the performance on several devices. The performance experiments show improvements ranging from 10% and up to more than fourfold speedup against competitive GPU MVM approaches. Performance impacts on high-level numerical libraries and a computational astronomy application are highlighted, since such memory-bound kernels are often located in innermost levels of the software chain. The excellent performance obtained in this work has led to the adoption of code in NVIDIAs widely distributed cuBLAS library.
|
254 |
Techniques for Managing Irregular Control Flow on GPUsJad Hbeika (5929730) 25 June 2020 (has links)
<p>GPGPU is a highly multithreaded throughput architecture that can deliver high speed-up for regular applications while remaining energy efficient. In recent years, there has been much focus on tuning irregular applications and/or the GPU architecture to achieve similar benefits for irregular applications as well as efforts to extract data parallelism from task parallel applications. In this work we tackle both problems.</p><p>The first part of this work tackles the problem of Control divergence in GPUs. GPGPUs’ SIMT execution model is ineffective for workloads with irregular control-flow because GPGPUs serialize the execution of divergent paths which lead to thread-level parallelism (TLP) loss. Previous works focused on creating new warps based on the control path threads follow, or created different warps for the different paths, or ran multiple narrower warps in parallel. While all previous solutions showed speedup for irregular workloads, they imposed some performance loss on regular workloads. In this work we propose a more fine-grained approach to exploit <i>intra-warp</i>convergence: rather than threads executing the same code path, <i>opcode-convergent threads</i>execute the same instruction, but with potentially different operands. Based on this new definition we find that divergent control blocks within a warp exhibit substantial opcode convergence. We build a compiler that analyzes divergent blocks and identifies the common streams of opcodes. We modify the GPU architecture so that these common instructions are executed as convergent instructions. Using software simulation, we achieve a 17% speedup over baseline GPGPU for irregular workloads and do not incur any performance loss on regular workloads.</p><p>In the second part we suggest techniques for extracting data parallelism from irregular, task parallel applications in order to take advantage of the massive parallelism provided by the GPU. Our technique involves dividing each task into multiple sub-tasks each performing less work and touching a smaller memory footprint. Our framework performs a locality-aware scheduling that works on minimizing the memory footprint of each warp (a set of threads performing in lock-step). We evaluate our framework with 3 task-parallel benchmarks and show that we can achieve significant speedups over optimized GPU code.</p>
|
255 |
Hierarchické shlukování s Mahalanobis-average metrikou akcelerované na GPU / GPU-accelerated Mahalanobis-average hierarchical clusteringŠmelko, Adam January 2020 (has links)
Hierarchical clustering algorithms are common tools for simplifying, exploring and analyzing datasets in many areas of research. For flow cytometry, a specific variant of agglomerative clustering has been proposed, that uses cluster linkage based on Mahalanobis distance to produce results better suited for the domain. Applicability of this clustering algorithm is currently limited by its relatively high computational complexity, which does not allow it to scale to common cytometry datasets. This thesis describes a specialized, GPU-accelerated version of the Mahalanobis-average linked hierarchical clustering, which improves the algorithm performance by several orders of magnitude, thus allowing it to scale to much larger datasets. The thesis provides an overview of current hierarchical clustering algorithms, and details the construction of the variant used on GPU. The result is benchmarked on publicly available high-dimensional data from mass cytometry.
|
256 |
Parallelizing Map Projection of Raster Data on Multi-core CPU and GPU Parallel Programming Frameworks / Parallellisering av kartprojektion av rasterdata på flerkärniga CPU- och GPU-programmeringsramverkChavez, Daniel January 2016 (has links)
Map projections lie at the core of geographic information systems and numerous projections are used today. The reprojection between different map projections is recurring in a geographic information system and it can be parallelized with multi-core CPUs and GPUs. This thesis implements a parallel analytic reprojection algorithm of raster data in C/C++ with the parallel programming frameworks Pthreads, C++11 STL threads, OpenMP, Intel TBB, CUDA and OpenCL. The thesis compares the execution times from the different implementations on small, medium and large raster data sets, where OpenMP had the best speedup of 6, 6.2 and 5.5, respectively. Meanwhile, the GPU implementations were 293 % faster than the fastest CPU implementations, where profiling shows that the CPU implementations spend most time on trigonometry functions. The results show that reprojection algorithm is well suited for the GPU, while OpenMP and Intel TBB are the fastest of the CPU frameworks. / Kartprojektioner är en central del av geografiska informationssystem och en otalig mängd av kartprojektioner används idag. Omprojiceringen mellan olika kartprojektioner sker regelbundet i ett geografiskt informationssystem och den kan parallelliseras med flerkärniga CPU:er och GPU:er. Denna masteruppsats implementerar en parallel och analytisk omprojicering av rasterdata i C/C++ med ramverken Pthreads, C++11 STL threads, OpenMP, Intel TBB, CUDA och OpenCL. Uppsatsen jämför de olika implementationernas exekveringstider på tre rasterdata av varierande storlek, där OpenMP hade bäst speedup på 6, 6.2 och 5.5. GPU-implementationerna var 293 % snabbare än de snabbaste CPU-implementationerna, där profileringen visar att de senare spenderade mest tid på trigonometriska funktioner. Resultaten visar att GPU:n är bäst lämpad för omprojicering av rasterdata, medan OpenMP är den snabbaste inom CPU ramverken.
|
257 |
An Experimental Fast Approach of Self-collision Handling in Cloth Simulation Using GPUJichun Zheng (10719285) 01 June 2021 (has links)
<p>This study describes a fast
approach using GPU to process self-collision in cloth animation without
significant compromise in physical accuracy. The proposed fast approach is
built and works effectively on a modification of Mass Spring Model which is
seen in a variety of cloth simulation study. Instead of using hierarchical data
structure which needs to be updated each frame, this fast approach adopts a
spatial hashing technique which virtually partitions the space where the cloth
object locates into small cubes and stores the information of the particles
being held in the cells with an integer array. With the data of the particles
and the cells holding information of the particles, self-collision detection
can be processed in a very limited cost in each thread launched in GPU
regardless of the increase in the amount of particles. This method is capable
of visualizing self-collision detection and response in real time with limited
cost in accessing memory on the GPU. </p>
<p>The idea of the proposed fast
approach is extremely straightforward, however, the amount of memory which is
needed to be consumed by this method is its weakness. Also, this method
sacrifices physical accuracy in exchange for the performance.</p>
|
258 |
GPU Accelerated Framework for Cryogenic Electron Tomography using Proximal AlgorithmsRey Ramirez, Julio A. 04 1900 (has links)
Cryogenic electron tomography provides visualization of cellular complexes in situ, allowing a further understanding of cellular function. However, the projection images from this technique present a meager signal-to-noise ratio due to the limited electron dose, and the lack of projections at high tilt angles produces the 'missing-wedge' problem in the Fourier domain. These limitations in the projection data prevent traditional reconstruction techniques from achieving good reconstructions. Multiple strategies have been proposed to deal with the noise and the artifacts arising from the 'missing-wedge’ problem. For example, manually selecting subtomograms of identical structures and averaging them (subtogram averaging), data-driven approaches that intend to perform subtogram averaging automatically, and various methods for denoising tilt-series before reconstruction or denoising the volumes after reconstruction. Most of these approaches are additional pre-processing or post-processing steps independent from the reconstruction method, and the consistency of the resulting tomograms with the original projection data is lost after the modifications. We propose a GPU accelerated optimization-based reconstruction framework using proximal algorithms. Our framework integrates denoising in the reconstruction process by alternating between reconstruction and denoising, relieving the users of the need to select additional denoising algorithms and preserving the consistency between final tomograms and projection data. Thanks to the flexibility provided by proximal algorithms, various available proximal operators can be interchanged for each task, e.g., various algebraic reconstruction methods and denoising techniques. We evaluate our approach qualitatively by comparison with current reconstruction and denoising approaches, showing excellent denoising capabilities and superior visual quality of the reconstructed tomograms. We quantitatively evaluate the methods with a recently proposed synthetic dataset for scanning transmission electron microscopy, achieving superior reconstruction quality for a noisy and angle-limited synthetic dataset.
|
259 |
Parallel Construction of LocalClearance TriangulationsGummesson, Simon, Johnson, Mikael Unknown Date (has links)
The usage of navigation meshes for path planning in games and otherdomains is a common approach. One type of navigation mesh that recently has beendeveloped is the Local Clearance Triangulation (LCT). The overall aim of the LCT isto construct a triangulation in such a way that a property called theLocal Clearancecan be used to calculate a path in a more efficient and cheap way. At the time ofwriting the thesis there only exists one solution that creates an LCT, this solution isonly using the CPU. Since the process of creating an LCT involves the insertion ofmany points and edge flips which only affects a local area it would be interesting toinvestigate the potential performance gain of using the GPU.Objectives.The objective of the thesis is to develop a GPU version based on thecurrent CPU LCT solution and to investigate in which cases the proposed GPU al-gorithm performs better.Methods.A GPU version and a CPU version of the proposed algorithm has beendeveloped to measure the performance gain of using the GPU, there are no algorith-mic differences between these versions. To measure the performance of the algorithmtwo tests have been constructed, the first test is called the Object Insertion test andmeasures the time it takes to build an LCT using generated test maps. The sec-ond test is called the Internal test and measures the internal performance of thealgorithm. A comparison between the GPU algorithm with an LCT library calledTriplanner was also done.Results.The proposed algorithm performed better on larger maps when imple-mented on a GPU compared to a CPU implementation of the algorithm. The GPUperformance compared to the Triplanner was faster in some of the larger maps.Conclusions.An algorithm that builds an LCT from scratch is presented. Theresults show that using the proposed algorithm on the GPU substantially increasesthe performance of the algorithm compared to when implementing it on a CPU.
|
260 |
Parallel Construction of Local Clearance TriangulationsGummesson, Simon, Johnson, Mikael January 2019 (has links)
The usage of navigation meshes for path planning in games and otherdomains is a common approach. One type of navigation mesh that recently has beendeveloped is the Local Clearance Triangulation (LCT). The overall aim of the LCT isto construct a triangulation in such a way that a property called theLocal Clearancecan be used to calculate a path in a more efficient and cheap way. At the time ofwriting the thesis there only exists one solution that creates an LCT, this solution isonly using the CPU. Since the process of creating an LCT involves the insertion ofmany points and edge flips which only affects a local area it would be interesting toinvestigate the potential performance gain of using the GPU.Objectives.The objective of the thesis is to develop a GPU version based on thecurrent CPU LCT solution and to investigate in which cases the proposed GPU al-gorithm performs better.Methods.A GPU version and a CPU version of the proposed algorithm has beendeveloped to measure the performance gain of using the GPU, there are no algorith-mic differences between these versions. To measure the performance of the algorithmtwo tests have been constructed, the first test is called the Object Insertion test andmeasures the time it takes to build an LCT using generated test maps. The sec-ond test is called the Internal test and measures the internal performance of thealgorithm. A comparison between the GPU algorithm with an LCT library calledTriplanner was also done.Results.The proposed algorithm performed better on larger maps when imple-mented on a GPU compared to a CPU implementation of the algorithm. The GPUperformance compared to the Triplanner was faster in some of the larger maps.Conclusions.An algorithm that builds an LCT from scratch is presented. Theresults show that using the proposed algorithm on the GPU substantially increasesthe performance of the algorithm compared to when implementing it on a CPU.
|
Page generated in 0.0242 seconds