Global ETD Search

1	MULTIPLE SEQUENCES ALIGNMENT FOR PHYLOGENETIC TREE CONSTRUCTION USING GRAPHICS PROCESSING UNITS He, Jintai 01 January 2008 (has links) Sequence alignment has become a routine procedure in evolutionary biology in looking for evolutionary relationships between primary sequences of DNA, RNA, and protein. Smith Waterman and Needleman Wunsch algorithms are two algorithms respectively for local alignment and global alignment. Both of them are based on dynamic programming and guarantee optimal results. They have been widely used for the past dozens of years. However, time and space requirement increase exponentially with the number of sequences increase. Here I present a novel approach to improve the performance of sequence alignment by using graphics processing unit which is capable of handling large amount of data in parallel. gpu computing sequence alignment
2	Accelerating Parallel Tasks by Optimizing GPU Hardware Utilization Tsung-Tai Yeh (8775680) 29 April 2020 (has links) <div> <div> <div> <p>Efficient GPU applications rely on programmers carefully structure their codes to fully utilize the GPU resources. In general, programmers spend a significant amount of time optimizing their applications to run efficiently on domain-specific architectures. To reduce the burden on programmers to utilize GPUs fully, I create several hardware and software solutions that improve the resource utilization on parallel processors without significant programmer intervention. </p><p><br></p> <p>Recently, GPUs are increasingly being deployed in data centers to accelerate latency-driven applications, which exhibit a modest amount of data parallelism. The synchronous kernel execution on these applications cannot fully utilize the entire GPU. Thus, a GPU contains multiple hardware queues to improve its throughput by executing multiple kernels on a single device simultaneously when there are sufficient hardware resources. However, a GPU faces severe underutilization when the space in these queues has been exhausted, and the performance benefit vanishes with the decreased parallelism. As a result, I proposed a GPU runtime system – Pagoda, which virtualizes the GPU hardware resources by using an OS-like daemon kernel called MasterKernel. Tasks (kernels) are spawned from the CPU onto Pagoda as they be- come available, and are scheduled by the MasterKernel at the warp granularity to increase the GPU throughput for latency-driven applications. This work invents several programming APIs to handle task spawning and synchronization and includes parallel tasks and warp scheduling policies to reduce runtime overhead. </p> </div> </div> <div> <div> <p><br></p> </div> </div> </div> <div> <div> <div> <p>Latency-driven applications have both high throughput demands and response time constraints. These applications may launch many kernels that do not fully utilize the GPU unless grouped with large batch sizes. However, batching forces jobs to wait, which increases their latency. This wait time can be unacceptable when considering real-world arrival times of jobs. However, the round-robin GPU kernel scheduler is oblivious to application deadlines. This deadline-blind scheduling policy makes it harder to ensure that kernels meet their QoS deadlines. To enhance the responsiveness of the GPU, I also proposed LAX, including an execution time estimate for jobs with one or many kernels. Moreover, LAX adjusts priorities of kernels dynamically based on their slack time to increase the number of jobs that complete by their real-time deadlines. LAX improves the responsiveness and throughput of GPUs. </p><p><br></p> <p>It is well-known that grouping threads into warps can create redundancy across scalar values in GPU vector registers. However, I also found that the layout of thread indices in multi-dimensional threadblocks (TBs) creates redundancy in the registers storing thread IDs. This redundancy propagates into dependent instructions that can be traced and identified statically. To remove GPU redundant instructions, I proposed DARSIE that uses a per-kernel compiler finalization check that uses TB dimensions to determine which instructions are redundant. Once identified, DARSIE hardware skips TB-redundant instructions before they are fetched. </p><p>DARSIE uses a new multithreaded register renaming and instruction synchronization technique to share the values from redundant instructions among warps in each TB. Altogether, DARSIE decreases the number of executed instructions to improve GPU performance and energy. </p> </div> </div> </div> Computer Engineering GPU computing
3	Um estudo do uso eficiente de programas em placas gráficas / A case study on the efficient use of programs on GPUs Ikeda, Patricia Akemi 20 September 2011 (has links) Inicialmente projetadas para processamento de gráficos, as placas gráficas (GPUs) evoluíram para um coprocessador paralelo de propósito geral de alto desempenho. Devido ao enorme potencial que oferecem para as diversas áreas de pesquisa e comerciais, a fabricante NVIDIA destaca-se pelo pioneirismo ao lançar a arquitetura CUDA (compatível com várias de suas placas), um ambiente capaz de tirar proveito do poder computacional aliado à maior facilidade de programação. Na tentativa de aproveitar toda a capacidade da GPU, algumas práticas devem ser seguidas. Uma delas consiste em manter o hardware o mais ocupado possível. Este trabalho propõe uma ferramenta prática e extensível que auxilie o programador a escolher a melhor configuração para que este objetivo seja alcançado. / Initially designed for graphical processing, the graphic cards (GPUs) evolved to a high performance general purpose parallel coprocessor. Due to huge potencial that graphic cards offer to several research and commercial areas, NVIDIA was the pioneer lauching of CUDA architecture (compatible with their several cards), an environment that take advantage of computacional power combined with an easier programming. In an attempt to make use of all capacity of GPU, some practices must be followed. One of them is to maximizes hardware utilization. This work proposes a practical and extensible tool that helps the programmer to choose the best configuration and achieve this goal. CUDA CUDA GPU Computing GPU Computing NVIDIA NVIDIA
4	Um estudo do uso eficiente de programas em placas gráficas / A case study on the efficient use of programs on GPUs Patricia Akemi Ikeda 20 September 2011 (has links) Inicialmente projetadas para processamento de gráficos, as placas gráficas (GPUs) evoluíram para um coprocessador paralelo de propósito geral de alto desempenho. Devido ao enorme potencial que oferecem para as diversas áreas de pesquisa e comerciais, a fabricante NVIDIA destaca-se pelo pioneirismo ao lançar a arquitetura CUDA (compatível com várias de suas placas), um ambiente capaz de tirar proveito do poder computacional aliado à maior facilidade de programação. Na tentativa de aproveitar toda a capacidade da GPU, algumas práticas devem ser seguidas. Uma delas consiste em manter o hardware o mais ocupado possível. Este trabalho propõe uma ferramenta prática e extensível que auxilie o programador a escolher a melhor configuração para que este objetivo seja alcançado. / Initially designed for graphical processing, the graphic cards (GPUs) evolved to a high performance general purpose parallel coprocessor. Due to huge potencial that graphic cards offer to several research and commercial areas, NVIDIA was the pioneer lauching of CUDA architecture (compatible with their several cards), an environment that take advantage of computacional power combined with an easier programming. In an attempt to make use of all capacity of GPU, some practices must be followed. One of them is to maximizes hardware utilization. This work proposes a practical and extensible tool that helps the programmer to choose the best configuration and achieve this goal. CUDA GPU Computing NVIDIA CUDA GPU Computing NVIDIA
5	GPGPU-Sim / A study on GPGPU-Sim Andersson, Filip January 2014 (has links) This thesis studies the impact of hardware features of graphics cards on performance of GPU computing using GPGPU-Sim simulation software tool. GPU computing is a growing topic in the world of computing, and could be an important milestone for computers. Therefore, such a study that seeks to identify the performance bottlenecks of the program with respect to hardware parameters of the devvice can be considered an important step towards tuning devices for higher efficiency. In this work we selected convolution algorithm - a typical GPGPU application - and conducted several tests to study different performance parameters. These tests were performed on two simulated graphics cards (NVIDIA GTX480, NVIDIA Tesla C2050), which are supported by GPGPU-Sim. By changing the hardware parameters of graphics card such as memory cache sizes, frequency and the number of cores, we can make a fine-grained analysis on the effect of these parameters on the performance of the program. A graphics card working on a picture convolution task releis on the L1 cache but has the worst performance with a small shared memory. Using this simulator to run performance tests on a theoretical GPU architecture could lead to better GPU design for embedded systems. GPGPU-Sim Simulator GPU computing
6	Multi-GPU Load Balancing for Simulation and Rendering Hagan, Robert Douglas 04 August 2011 (has links) GPU computing can significantly improve performance by taking advantage of massive parallelism of GPUs for data parallel applications. Computation in visualization applications is suitable for parallelization on the GPU, which can improve performance and interactivity in these applications. If used effectively, multiple GPUs can lead to a significant speedup over a single GPU. However, the use of multiple GPUs requires memory management, scheduling, and load balancing to ensure that a program takes full advantage of available processors. This work presents methods for data-driven and dynamic multi-GPU load balancing using a pipelined approach and a framework for use with different applications. Data-driven load balancing can improve utilization for applications by taking into account past performance for different combinations of input parameters. The dynamic load balancing method based on buffer fullness can adjust to workload changes at runtime to gain an additional performance improvement. This work provides a framework for load balancing to account for differing characteristics of applications. Implementation of a multi-GPU data structure allows for use of these load balancing methods in the framework. The effectiveness of the framework is demonstrated with performance results from interactive visualization that shows a significant speedup due to load balancing. / Master of Science Load Balancing Multi-GPU Computing Rendering Simulation
7	Ray-traced radiative transfer on massively threaded architectures Thomson, Samuel Paul January 2018 (has links) In this thesis, I apply techniques from the field of computer graphics to ray tracing in astrophysical simulations, and introduce the grace software library. This is combined with an extant radiative transfer solver to produce a new package, taranis. It allows for fully-parallel particle updates via per-particle accumulation of rates, followed by a forward Euler integration step, and is manifestly photon-conserving. To my knowledge, taranis is the first ray-traced radiative transfer code to run on graphics processing units and target cosmological-scale smooth particle hydrodynamics (SPH) datasets. A significant optimization effort is undertaken in developing grace. Contrary to typical results in computer graphics, it is found that the bounding volume hierarchies (BVHs) used to accelerate the ray tracing procedure need not be of high quality; as a result, extremely fast BVH construction times are possible (< 0.02 microseconds per particle in an SPH dataset). I show that this exceeds the performance researchers might expect from CPU codes by at least an order of magnitude, and compares favourably to a state-of-the-art ray tracing solution. Similar results are found for the ray-tracing itself, where again techniques from computer graphics are examined for effectiveness with SPH datasets, and new optimizations proposed. For high per-source ray counts (≳ 104), grace can reduce ray tracing run times by up to two orders of magnitude compared to extant CPU solutions developed within the astrophysics community, and by a factor of a few compared to a state-of-the-art solution. taranis is shown to produce expected results in a suite of de facto cosmological radiative transfer tests cases. For some cases, it currently out-performs a serial, CPU-based alternative by a factor of a few. Unfortunately, for the most realistic test its performance is extremely poor, making the current taranis code unsuitable for cosmological radiative transfer. The primary reason for this failing is found to be a small minority of particles which always dominate the timestep criteria. Several plausible routes to mitigate this problem, while retaining parallelism, are put forward.
8	Accelerating electromagnetic transient simulation of electrical power systems using graphics processing units Debnath, Jayanta 25 June 2015 (has links) This thesis presents the application of graphics processing unit (GPU) based parallel computing technique to speed up electromagnetic transients (EMT) simulation of large power systems. GPUs support extra computing capability to handle gaming and animation related applications in the desktop computers. GPUs can be used for general-purpose computations, such as EMT simulation. Traditionally, EMT simulation tools are implemented on the CPUs, where simulation is performed in a sequential manner. Hence, with the increase in network size, there is a drastic increase in simulation times. This research shows that the use of GPU computing considerably reduces the total simulation time. This thesis proposes parallelized algorithm for EMT simulations on the GPU, and demonstrates the algorithm by simulating large power systems. Total computation times for GPU computing, using 'compute unified device architecture' (CUDA)-based C programming are compared with the total computation times for the sequential implementations on the CPU using ANSI-C programming for systems of various sizes and types. Special parallel processing techniques are implemented to model various power system components such as transmission lines, generators, etc. An advanced technique to implement parallel matrix-vector multiplication on the GPU is implemented, which shows a significant performance gain in the simulation. A sparsity-based technique for the inverse admittance matrix is implemented in this simulation process to ignore the multiplications involving zeros. A typical power electronic subsystem is also implemented in this simulation process, which had not been implemented in the literature so far for the GPU platforms. GPU computing-based simulation of large power networks with many power electronic subsystems has shown a massive performance gain compared to conventional sequential simulations with and without the sparsity technique. Finally, in this research work, the effect of granularity on the speedup of simulation was investigated. Granularity is defined as the ratio of the number of transmission lines used to interconnect various subsystems to the total size of the network. It should be noted that dividing a network into smaller subsystems requires additional transmission lines. Simulation results show that there is a negative impact on the overall performance gain of simulation with the use of excessive transmission lines in the test systems. Power Systems Eelectromagnetic Simulation parallel processing GPU-Computing Transient Simulation
9	Analysis and Performance Optimization of a GPGPU Implementation of Image Quality Assessment (IQA) Algorithm VSNR January 2017 (has links) abstract: Image processing has changed the way we store, view and share images. One important component of sharing images over the networks is image compression. Lossy image compression techniques compromise the quality of images to reduce their size. To ensure that the distortion of images due to image compression is not highly detectable by humans, the perceived quality of an image needs to be maintained over a certain threshold. Determining this threshold is best done using human subjects, but that is impractical in real-world scenarios. As a solution to this issue, image quality assessment (IQA) algorithms are used to automatically compute a fidelity score of an image. However, poor performance of IQA algorithms has been observed due to complex statistical computations involved. General Purpose Graphics Processing Unit (GPGPU) programming is one of the solutions proposed to optimize the performance of these algorithms. This thesis presents a Compute Unified Device Architecture (CUDA) based optimized implementation of full reference IQA algorithm, Visual Signal to Noise Ratio (VSNR) that uses M-level 2D Discrete Wavelet Transform (DWT) with 9/7 biorthogonal filters among other statistical computations. The presented implementation is tested upon four different image quality databases containing images with multiple distortions and sizes ranging from 512 x 512 to 1600 x 1280. The CUDA implementation of VSNR shows a speedup of over 32x for 1600 x 1280 images. It is observed that the speedup scales with the increase in size of images. The results showed that the implementation is fast enough to use VSNR on high definition videos with a frame rate of 60 fps. This work presents the optimizations made due to the use of GPU’s constant memory and reuse of allocated memory on the GPU. Also, it shows the performance improvement using profiler driven GPGPU development in CUDA. The presented implementation can be deployed in production combined with existing applications. / Dissertation/Thesis / Masters Thesis Computer Science 2017 Computer science Computer engineering GPU Computing Image Quality Assessment
10	Accelerating Scientific Applications using High Performance Dense and Sparse Linear Algebra Kernels on GPUs Abdelfattah, Ahmad 15 January 2015 (has links) High performance computing (HPC) platforms are evolving to more heterogeneous configurations to support the workloads of various applications. The current hardware landscape is composed of traditional multicore CPUs equipped with hardware accelerators that can handle high levels of parallelism. Graphical Processing Units (GPUs) are popular high performance hardware accelerators in modern supercomputers. GPU programming has a different model than that for CPUs, which means that many numerical kernels have to be redesigned and optimized specifically for this architecture. GPUs usually outperform multicore CPUs in some compute intensive and massively parallel applications that have regular processing patterns. However, most scientific applications rely on crucial memory-bound kernels and may witness bottlenecks due to the overhead of the memory bus latency. They can still take advantage of the GPU compute power capabilities, provided that an efficient architecture-aware design is achieved. This dissertation presents a uniform design strategy for optimizing critical memory-bound kernels on GPUs. Based on hierarchical register blocking, double buffering and latency hiding techniques, this strategy leverages the performance of a wide range of standard numerical kernels found in dense and sparse linear algebra libraries. The work presented here focuses on matrix-vector multiplication kernels (MVM) as repre- sentative and most important memory-bound operations in this context. Each kernel inherits the benefits of the proposed strategies. By exposing a proper set of tuning parameters, the strategy is flexible enough to suit different types of matrices, ranging from large dense matrices, to sparse matrices with dense block structures, while high performance is maintained. Furthermore, the tuning parameters are used to maintain the relative performance across different GPU architectures. Multi-GPU acceleration is proposed to scale the performance on several devices. The performance experiments show improvements ranging from 10% and up to more than fourfold speedup against competitive GPU MVM approaches. Performance impacts on high-level numerical libraries and a computational astronomy application are highlighted, since such memory-bound kernels are often located in innermost levels of the software chain. The excellent performance obtained in this work has led to the adoption of code in NVIDIAs widely distributed cuBLAS library. GPU Computing Numerical Linear Algebra Memory-Bound Kernels Performance Optimization

Search results