Global ETD Search

1	Accelerating Parallel Tasks by Optimizing GPU Hardware Utilization Tsung-Tai Yeh (8775680) 29 April 2020 (has links) <div> <div> <div> <p>Efficient GPU applications rely on programmers carefully structure their codes to fully utilize the GPU resources. In general, programmers spend a significant amount of time optimizing their applications to run efficiently on domain-specific architectures. To reduce the burden on programmers to utilize GPUs fully, I create several hardware and software solutions that improve the resource utilization on parallel processors without significant programmer intervention. </p><p><br></p> <p>Recently, GPUs are increasingly being deployed in data centers to accelerate latency-driven applications, which exhibit a modest amount of data parallelism. The synchronous kernel execution on these applications cannot fully utilize the entire GPU. Thus, a GPU contains multiple hardware queues to improve its throughput by executing multiple kernels on a single device simultaneously when there are sufficient hardware resources. However, a GPU faces severe underutilization when the space in these queues has been exhausted, and the performance benefit vanishes with the decreased parallelism. As a result, I proposed a GPU runtime system – Pagoda, which virtualizes the GPU hardware resources by using an OS-like daemon kernel called MasterKernel. Tasks (kernels) are spawned from the CPU onto Pagoda as they be- come available, and are scheduled by the MasterKernel at the warp granularity to increase the GPU throughput for latency-driven applications. This work invents several programming APIs to handle task spawning and synchronization and includes parallel tasks and warp scheduling policies to reduce runtime overhead. </p> </div> </div> <div> <div> <p><br></p> </div> </div> </div> <div> <div> <div> <p>Latency-driven applications have both high throughput demands and response time constraints. These applications may launch many kernels that do not fully utilize the GPU unless grouped with large batch sizes. However, batching forces jobs to wait, which increases their latency. This wait time can be unacceptable when considering real-world arrival times of jobs. However, the round-robin GPU kernel scheduler is oblivious to application deadlines. This deadline-blind scheduling policy makes it harder to ensure that kernels meet their QoS deadlines. To enhance the responsiveness of the GPU, I also proposed LAX, including an execution time estimate for jobs with one or many kernels. Moreover, LAX adjusts priorities of kernels dynamically based on their slack time to increase the number of jobs that complete by their real-time deadlines. LAX improves the responsiveness and throughput of GPUs. </p><p><br></p> <p>It is well-known that grouping threads into warps can create redundancy across scalar values in GPU vector registers. However, I also found that the layout of thread indices in multi-dimensional threadblocks (TBs) creates redundancy in the registers storing thread IDs. This redundancy propagates into dependent instructions that can be traced and identified statically. To remove GPU redundant instructions, I proposed DARSIE that uses a per-kernel compiler finalization check that uses TB dimensions to determine which instructions are redundant. Once identified, DARSIE hardware skips TB-redundant instructions before they are fetched. </p><p>DARSIE uses a new multithreaded register renaming and instruction synchronization technique to share the values from redundant instructions among warps in each TB. Altogether, DARSIE decreases the number of executed instructions to improve GPU performance and energy. </p> </div> </div> </div> Computer Engineering GPU computing
2	Um estudo do uso eficiente de programas em placas gráficas / A case study on the efficient use of programs on GPUs Ikeda, Patricia Akemi 20 September 2011 (has links) Inicialmente projetadas para processamento de gráficos, as placas gráficas (GPUs) evoluíram para um coprocessador paralelo de propósito geral de alto desempenho. Devido ao enorme potencial que oferecem para as diversas áreas de pesquisa e comerciais, a fabricante NVIDIA destaca-se pelo pioneirismo ao lançar a arquitetura CUDA (compatível com várias de suas placas), um ambiente capaz de tirar proveito do poder computacional aliado à maior facilidade de programação. Na tentativa de aproveitar toda a capacidade da GPU, algumas práticas devem ser seguidas. Uma delas consiste em manter o hardware o mais ocupado possível. Este trabalho propõe uma ferramenta prática e extensível que auxilie o programador a escolher a melhor configuração para que este objetivo seja alcançado. / Initially designed for graphical processing, the graphic cards (GPUs) evolved to a high performance general purpose parallel coprocessor. Due to huge potencial that graphic cards offer to several research and commercial areas, NVIDIA was the pioneer lauching of CUDA architecture (compatible with their several cards), an environment that take advantage of computacional power combined with an easier programming. In an attempt to make use of all capacity of GPU, some practices must be followed. One of them is to maximizes hardware utilization. This work proposes a practical and extensible tool that helps the programmer to choose the best configuration and achieve this goal. CUDA CUDA GPU Computing GPU Computing NVIDIA NVIDIA
3	Um estudo do uso eficiente de programas em placas gráficas / A case study on the efficient use of programs on GPUs Patricia Akemi Ikeda 20 September 2011 (has links) Inicialmente projetadas para processamento de gráficos, as placas gráficas (GPUs) evoluíram para um coprocessador paralelo de propósito geral de alto desempenho. Devido ao enorme potencial que oferecem para as diversas áreas de pesquisa e comerciais, a fabricante NVIDIA destaca-se pelo pioneirismo ao lançar a arquitetura CUDA (compatível com várias de suas placas), um ambiente capaz de tirar proveito do poder computacional aliado à maior facilidade de programação. Na tentativa de aproveitar toda a capacidade da GPU, algumas práticas devem ser seguidas. Uma delas consiste em manter o hardware o mais ocupado possível. Este trabalho propõe uma ferramenta prática e extensível que auxilie o programador a escolher a melhor configuração para que este objetivo seja alcançado. / Initially designed for graphical processing, the graphic cards (GPUs) evolved to a high performance general purpose parallel coprocessor. Due to huge potencial that graphic cards offer to several research and commercial areas, NVIDIA was the pioneer lauching of CUDA architecture (compatible with their several cards), an environment that take advantage of computacional power combined with an easier programming. In an attempt to make use of all capacity of GPU, some practices must be followed. One of them is to maximizes hardware utilization. This work proposes a practical and extensible tool that helps the programmer to choose the best configuration and achieve this goal. CUDA GPU Computing NVIDIA CUDA GPU Computing NVIDIA
4	GPGPU-Sim / A study on GPGPU-Sim Andersson, Filip January 2014 (has links) This thesis studies the impact of hardware features of graphics cards on performance of GPU computing using GPGPU-Sim simulation software tool. GPU computing is a growing topic in the world of computing, and could be an important milestone for computers. Therefore, such a study that seeks to identify the performance bottlenecks of the program with respect to hardware parameters of the devvice can be considered an important step towards tuning devices for higher efficiency. In this work we selected convolution algorithm - a typical GPGPU application - and conducted several tests to study different performance parameters. These tests were performed on two simulated graphics cards (NVIDIA GTX480, NVIDIA Tesla C2050), which are supported by GPGPU-Sim. By changing the hardware parameters of graphics card such as memory cache sizes, frequency and the number of cores, we can make a fine-grained analysis on the effect of these parameters on the performance of the program. A graphics card working on a picture convolution task releis on the L1 cache but has the worst performance with a small shared memory. Using this simulator to run performance tests on a theoretical GPU architecture could lead to better GPU design for embedded systems. GPGPU-Sim Simulator GPU computing
5	Multi-GPU Load Balancing for Simulation and Rendering Hagan, Robert Douglas 04 August 2011 (has links) GPU computing can significantly improve performance by taking advantage of massive parallelism of GPUs for data parallel applications. Computation in visualization applications is suitable for parallelization on the GPU, which can improve performance and interactivity in these applications. If used effectively, multiple GPUs can lead to a significant speedup over a single GPU. However, the use of multiple GPUs requires memory management, scheduling, and load balancing to ensure that a program takes full advantage of available processors. This work presents methods for data-driven and dynamic multi-GPU load balancing using a pipelined approach and a framework for use with different applications. Data-driven load balancing can improve utilization for applications by taking into account past performance for different combinations of input parameters. The dynamic load balancing method based on buffer fullness can adjust to workload changes at runtime to gain an additional performance improvement. This work provides a framework for load balancing to account for differing characteristics of applications. Implementation of a multi-GPU data structure allows for use of these load balancing methods in the framework. The effectiveness of the framework is demonstrated with performance results from interactive visualization that shows a significant speedup due to load balancing. / Master of Science Load Balancing Multi-GPU Computing Rendering Simulation
6	Ray-traced radiative transfer on massively threaded architectures Thomson, Samuel Paul January 2018 (has links) In this thesis, I apply techniques from the field of computer graphics to ray tracing in astrophysical simulations, and introduce the grace software library. This is combined with an extant radiative transfer solver to produce a new package, taranis. It allows for fully-parallel particle updates via per-particle accumulation of rates, followed by a forward Euler integration step, and is manifestly photon-conserving. To my knowledge, taranis is the first ray-traced radiative transfer code to run on graphics processing units and target cosmological-scale smooth particle hydrodynamics (SPH) datasets. A significant optimization effort is undertaken in developing grace. Contrary to typical results in computer graphics, it is found that the bounding volume hierarchies (BVHs) used to accelerate the ray tracing procedure need not be of high quality; as a result, extremely fast BVH construction times are possible (< 0.02 microseconds per particle in an SPH dataset). I show that this exceeds the performance researchers might expect from CPU codes by at least an order of magnitude, and compares favourably to a state-of-the-art ray tracing solution. Similar results are found for the ray-tracing itself, where again techniques from computer graphics are examined for effectiveness with SPH datasets, and new optimizations proposed. For high per-source ray counts (≳ 104), grace can reduce ray tracing run times by up to two orders of magnitude compared to extant CPU solutions developed within the astrophysics community, and by a factor of a few compared to a state-of-the-art solution. taranis is shown to produce expected results in a suite of de facto cosmological radiative transfer tests cases. For some cases, it currently out-performs a serial, CPU-based alternative by a factor of a few. Unfortunately, for the most realistic test its performance is extremely poor, making the current taranis code unsuitable for cosmological radiative transfer. The primary reason for this failing is found to be a small minority of particles which always dominate the timestep criteria. Several plausible routes to mitigate this problem, while retaining parallelism, are put forward.
7	Accelerating electromagnetic transient simulation of electrical power systems using graphics processing units Debnath, Jayanta 25 June 2015 (has links) This thesis presents the application of graphics processing unit (GPU) based parallel computing technique to speed up electromagnetic transients (EMT) simulation of large power systems. GPUs support extra computing capability to handle gaming and animation related applications in the desktop computers. GPUs can be used for general-purpose computations, such as EMT simulation. Traditionally, EMT simulation tools are implemented on the CPUs, where simulation is performed in a sequential manner. Hence, with the increase in network size, there is a drastic increase in simulation times. This research shows that the use of GPU computing considerably reduces the total simulation time. This thesis proposes parallelized algorithm for EMT simulations on the GPU, and demonstrates the algorithm by simulating large power systems. Total computation times for GPU computing, using 'compute unified device architecture' (CUDA)-based C programming are compared with the total computation times for the sequential implementations on the CPU using ANSI-C programming for systems of various sizes and types. Special parallel processing techniques are implemented to model various power system components such as transmission lines, generators, etc. An advanced technique to implement parallel matrix-vector multiplication on the GPU is implemented, which shows a significant performance gain in the simulation. A sparsity-based technique for the inverse admittance matrix is implemented in this simulation process to ignore the multiplications involving zeros. A typical power electronic subsystem is also implemented in this simulation process, which had not been implemented in the literature so far for the GPU platforms. GPU computing-based simulation of large power networks with many power electronic subsystems has shown a massive performance gain compared to conventional sequential simulations with and without the sparsity technique. Finally, in this research work, the effect of granularity on the speedup of simulation was investigated. Granularity is defined as the ratio of the number of transmission lines used to interconnect various subsystems to the total size of the network. It should be noted that dividing a network into smaller subsystems requires additional transmission lines. Simulation results show that there is a negative impact on the overall performance gain of simulation with the use of excessive transmission lines in the test systems. Power Systems Eelectromagnetic Simulation parallel processing GPU-Computing Transient Simulation
8	Analysis and Performance Optimization of a GPGPU Implementation of Image Quality Assessment (IQA) Algorithm VSNR January 2017 (has links) abstract: Image processing has changed the way we store, view and share images. One important component of sharing images over the networks is image compression. Lossy image compression techniques compromise the quality of images to reduce their size. To ensure that the distortion of images due to image compression is not highly detectable by humans, the perceived quality of an image needs to be maintained over a certain threshold. Determining this threshold is best done using human subjects, but that is impractical in real-world scenarios. As a solution to this issue, image quality assessment (IQA) algorithms are used to automatically compute a fidelity score of an image. However, poor performance of IQA algorithms has been observed due to complex statistical computations involved. General Purpose Graphics Processing Unit (GPGPU) programming is one of the solutions proposed to optimize the performance of these algorithms. This thesis presents a Compute Unified Device Architecture (CUDA) based optimized implementation of full reference IQA algorithm, Visual Signal to Noise Ratio (VSNR) that uses M-level 2D Discrete Wavelet Transform (DWT) with 9/7 biorthogonal filters among other statistical computations. The presented implementation is tested upon four different image quality databases containing images with multiple distortions and sizes ranging from 512 x 512 to 1600 x 1280. The CUDA implementation of VSNR shows a speedup of over 32x for 1600 x 1280 images. It is observed that the speedup scales with the increase in size of images. The results showed that the implementation is fast enough to use VSNR on high definition videos with a frame rate of 60 fps. This work presents the optimizations made due to the use of GPU’s constant memory and reuse of allocated memory on the GPU. Also, it shows the performance improvement using profiler driven GPGPU development in CUDA. The presented implementation can be deployed in production combined with existing applications. / Dissertation/Thesis / Masters Thesis Computer Science 2017 Computer science Computer engineering GPU Computing Image Quality Assessment
9	Graphic-Processing-Units Based Adaptive Parameter Estimation of a Visual Psychophysical Model Gu, Hairong 17 December 2012 (has links) No description available. Psychology Psychophysics Adaptive Design Optimization GPU computing parameter estimation
10	Accelerating a Coupled SPH-FEM Solver through Heterogeneous Computing for use in Fluid-Structure Interaction Problems Gilbert, John Nicholas 08 June 2015 (has links) This work presents a partitioned approach to simulating free-surface flow interaction with hyper-elastic structures in which a smoothed particle hydrodynamics (SPH) solver is coupled with a finite-element (FEM) solver. SPH is a mesh-free, Lagrangian numerical technique frequently employed to study physical phenomena involving large deformations, such as fragmentation or breaking waves. As a mesh-free Lagrangian method, SPH makes an attractive alternative to traditional grid-based methods for modeling free-surface flows and/or problems with rapid deformations where frequent re-meshing and additional free-surface tracking algorithms are non-trivial. This work continues and extends the earlier coupled 2D SPH-FEM approach of Yang et al. [1,2] by linking a double-precision GPU implementation of a 3D weakly compressible SPH formulation [3] with the open source finite element software Code_Aster [4]. Using this approach, the fluid domain is evolved on the GPU, while the CPU updates the structural domain. Finally, the partitioned solutions are coupled using a traditional staggered algorithm. / Ph. D. smoothed particle hydrodynamics meshfree CFD fluid-structure interaction GPU computing

Search results