Global ETD Search

61	Autotuning wavefront patterns for heterogeneous architectures Mohanty, Siddharth January 2015 (has links) Manual tuning of applications for heterogeneous parallel systems is tedious and complex. Optimizations are often not portable, and the whole process must be repeated when moving to a new system, or sometimes even to a different problem size. Pattern based parallel programming models were originally designed to provide programmers with an abstract layer, hiding tedious parallel boilerplate code, and allowing a focus on only application specific issues. However, the constrained algorithmic model associated with each pattern also enables the creation of pattern-specific optimization strategies. These can capture more complex variations than would be accessible by analysis of equivalent unstructured source code. These variations create complex optimization spaces. Machine learning offers well established techniques for exploring such spaces. In this thesis we use machine learning to create autotuning strategies for heterogeneous parallel implementations of applications which follow the wavefront pattern. In a wavefront, computation starts from one corner of the problem grid and proceeds diagonally like a wave to the opposite corner in either two or three dimensions. Our framework partitions and optimizes the work created by these applications across systems comprising multicore CPUs and multiple GPU accelerators. The tuning opportunities for a wavefront include controlling the amount of computation to be offloaded onto GPU accelerators, choosing the number of CPU and GPU threads to process tasks, tiling for both CPU and GPU memory structures, and trading redundant halo computation against communication for multiple GPUs. Our exhaustive search of the problem space shows that these parameters are very sensitive to the combination of architecture, wavefront instance and problem size. We design and investigate a family of autotuning strategies, targeting single and multiple CPU + GPU systems, and both two and three dimensional wavefront instances. These yield an average of 87% of the performance found by offline exhaustive search, with up to 99% in some cases. 004
62	On-Chip Diagnosis of Generalized Delay Failures using Compact Fault Dictionaries Beckler, Matthew Layne 01 April 2017 (has links) Integrated Circuits (ICs) are an essential part of nearly every electronic device. From toys to appliances, spacecraft to power plants, modern society truly depends on the reliable operation of billions of ICs around the world. The steady shrinking of IC transistors over past decades has enabled drastic improvements in IC performance while reducing area and power consumption. However, with continued scaling of semiconductor fabrication processes, failure sources of many types are becoming more pronounced and are increasingly affecting system operation. Additionally, increasing variation during fabrication also increases the difficulty of yielding chips in a cost-effective manner. Finally, phenomena such as early-life and wear-out failures pose new challenges to ensuring robustness. One approach for ensuring robustness centers on performing test during run-time, identifying the location of any defects, and repairing, replacing, or avoiding the affected portion of the system. Leveraging the existing design-for-testability (DFT) structures, thorough tests that target these delay defects are applied using the scan logic. Testing is performed periodically to minimize user-perceived performance loss, and if testing detects any failures, on-chip diagnosis is performed to localize the defect to the level of repair, replacement, or avoidance. In this dissertation, an on-chip diagnosis solution using a fault dictionary is described and validated through a large variety of experiments. Conventional fault dictionary approaches can be used to locate failures but are limited to simplistic fail behaviors due to the significant computational resources required for dictionary generation and memory storage. To capture the misbehaviors expected from scaled technologies, including early-life and wear-out failures, the Transition-X (TRAX) fault model is introduced. Similar to a transition fault, a TRAX fault is activated by a signal level transition or glitch, and produces the unknown value X when activated. Recognizing that the limited options for runtime recovery of defective hardware relax the conventional requirements for defect localization, a new fault dictionary is developed to provide diagnosis localization only to the required level of the design hierarchy. On-chip diagnosis using such a hierarchical dictionary is performed using a new scalable hardware architecture. To reduce the computation time required to generate the TRAX hierarchical dictionary for large designs, the incredible parallelism of graphics processing units (GPUs) is harnessed to provide an efficient fault simulation engine for dictionary construction. Finally, the on-chip diagnosis process is evaluated for suitability in providing accurate diagnosis results even when multiple concurrent defects are affecting a circuit. diagnosis fault dictionary fault models gpu test
63	GPU computing of Heat Equations Zhang, Junchi 29 April 2015 (has links) There is an increasing amount of evidence in scientific research and industrial engineering indicating that the graphic processing unit (GPU) has a higher efficiency and a stronger ability over CPUs to process certain computations. The heat equation is one of the most well-known partial differential equations with well-developed theories, and application in engineering. Thus, we chose in this report to use the heat equation to numerically solve for the heat distributions at different time points using both GPU and CPU programs. The heat equation with three different boundary conditions (Dirichlet, Neumann and Periodic) were calculated on the given domain and discretized by finite difference approximations. The programs solving the linear system from the heat equation with different boundary conditions were implemented on GPU and CPU. A convergence analysis and stability analysis for the finite difference method was performed to guarantee the success of the program. Iterative methods and direct methods to solve the linear system are also discussed for the GPU. The results show that the GPU has a huge advantage in terms of time spent compared with CPU in large size problems. Heat equation GPU linear systems. CPU
64	Accelerating Cryptosystems on Hardware Platforms Wang, Wei 13 April 2014 (has links) In the past decade, one of the major breakthroughs in computer science theory is the first construction of fully homomorphic encryption (FHE) scheme introduced by Gentry. Using a FHE one may perform an arbitrary numbers of computations directly on the encrypted data without revealing of the secret key. Therefore, a practical FHE provides an invaluable security application for emerging technologies such as cloud computing and cloud-based storage. However, FHE is far from real life deployment due to serious efficiency impediments. The main part of this dissertation focuses on accelerating the existing FHE schemes using GPU and hardware design to make them more efficient and practical towards real-life applications. Another part of this dissertation is for the hardware design of the large key-size RSA cryptosystem. As the Moore law continues driving the computer technology, the key size of the Rivest-Shamir-Adelman (RSA) encryption is necessary to be upgraded to 2048, 4096 or even 8192 bits to provide higher level security. In this dissertation, the FFT multiplication is employed for the large-size RSA hardware design instead of using the traditional interleaved Montgomery multiplication to show the feasibility of the FFT multiplication for large-size RSA design. RSA VLSI Fully homomorphic encryption GPU
65	Accelerating SRD Simulation on GPU Chen, Zhilu 17 April 2013 (has links) Stochastic Rotation Dynamics (SRD) is a particle-based simulation method that can be used to model complex fluids either in two or three dimensions, which is very useful in biology and physics study. Although SRD is computationally efficient compared to other simulations, it still takes a long time to run the simulation when the size of the model is large, e.g. when using a large array of particles to simulate dense polymers. In some cases, the simulation could take months before getting the results. Thus, this research focuses on the acceleration of the SRD simulation by using GPU. GPU acceleration can reduce the simulation time by orders of magnitude. It is also cost-effective because a GPU costs significantly less than a computer cluster. Compute Unified Device Architecture (CUDA) programming makes it possible to parallelize the program to run on hundreds or thousands of thread processors on GPU. The program is divided into many concurrent threads. In addition, several kernel functions are used for data synchronization. The speedup of GPU acceleration is varied for different parameters of the simulation program, such as size of the model, density of the particles, formation of polymers, and above all the complexity of the algorithm itself. Compared to the CPU version, it is about 10 times speedup for the particle simulation and up to 50 times speedup for polymers. Further performance improvement can be achieved by using multiple GPUs and code optimization. SRD simulation GPU acceleration CUDA programming
66	Méthode de Krylov itératives avec communication et efficacité énergétique optimisées sur machine hétérogène / Krylov iterative method with communication and energy efficiency optimization on heterogeneous clusters Chen, Langshi 04 November 2015 (has links) Les méthodes de Krylov sont fréquemment utilisés dans des problèmes linéaires, comme de résoudre des systèmes linéaires ou de trouver des valeurs propres et vecteurs propres de matrices, avec une taille extrêmement grande. Comme ces méthodes itératives nécessitent un calcul intensif, ils sont normalement déployés sur des grands clusters avec les mémoires distribués et les données communiqués par MPI. Lorsque la taille du problème augmente, la communication devient un bouchon principale d'atteindre une haute scalabité à cause de deux raisons: 1) La plupart des méthodes itératives comptent sur BLAS-2 matrices-vecteurs opérations de bas niveau qui sont communication intensive. 2) Le mouvement de données (accès à la mémoire, la communication par MPI) est beaucoup plus lent que la fréquence du processeur. Dans le cas des opérations de matrice creuse tels que la multiplication de matrices creuses et vecteurs (SpMV), le temps de communication devient dominant par rapport au temps de calcul. En outre, l'avènement des accélérateurs et coprocesseurs comme le GPU de NVIDIA fait le coût du calcul moins cher, tandis que le coût de la communication reste élevé dans des systèmes hétérogènes. Ainsi, la première partie de nos travaux se concentre sur l'optimisation des coûts de communication pour des méthodes itératives sur des clusters hétérogènes. En dehors du coût de communication, le mur de la puissance et de l’énergie devient un autre bouchon de scalabité pour le futur calcul exascale. Les recherches indiquent que la mise en œuvre des implémentations d'algorithmes qui sont informées pourrait efficacement réduire la dissipation de puissance des clusters. Nous explorons également la mise en œuvre des méthodes et des implémentations qui économisent l'énergie dans notre expérimentation. Enfin, l'optimisation de la communication et la mise en œuvre de l'efficacité énergétique seraient intégrés dans un schéma de méthode GMRES, qui exige un cadre d'auto-tuning pour optimiser sa performance. / Iterative methods are frequently used in extremely large scale linear problems, such solving linear systems or finding eigenvalue/eigenvectors of matrices. As these iterative methods require a substantial computational workload, they are normally deployed on large clusters of distributed memory architectures communicated via MPI. When the problem size scales up, the communication becomes a major bottleneck of reaching a higher scalability because of two reasons: 1) Many of the iterative methods rely on BLAS-2 low level matrix vector kernels that are communication intensive. 2) Data movement (memory access, MPI communication) is much slower than processor's speed. In case of sparse matrix operations such as Sparse Matrix Vector Multiplication (SpMV), the communication even replaces the computation as the dominant time cost. Furthermore, the advent of accelerators/coprocessors like Nvidia's GPU make computation cost more cheaper, while the communication cost remains high in such CPU-coprocessor heterogeneous systems. Thus, the first part of our work focus on the optimization of communication cost of iterative methods on heterogeneous clusters. Besides the communication cost, power wall becomes another bottleneck of future exascale computing in recent time. Researches indicate that a power-aware algorithmic implementation strategy could efficiently reduce the power dissipation of large clusters. We also explore the potential energy saving implementation of iterative methods in our experimentation. Finally, both the communication optimization and energy efficiency implementation would be integrated into a GMRES method, which demands an auto-tuning framework to maximize its performance. Supercalculateurs exascale Matrice creuse Accélérateurs GPU 004.35
67	Multiresolution image-space rendering for interactive global illumination Nichols, Gregory Boyd 01 July 2010 (has links) Global illumination adds tremendous visual richness to rendered images. Unfortunately, such illumination proves quite costly to compute, and is therefore often coarsely approximated by interactive applications, or simply omitted altogether. Global illumination is often quite low-frequency, aside from sharp changes at discontinuities. This thesis describes three novel multiresolution image-space methods that exploit this characteristic to accelerate rendering speeds. These techniques run completely on the GPU at interactive rates and require no precomputation, allowing fully dynamic lighting, geometry, and camera. The first approach, multiresolution splatting, is a novel multiresolution method for rendering indirect illumination. This work extends reflective shadow maps, an image space method that splats contributions from secondary light sources into eye-space. Splats are refined into multiresolution patches, rendering indirect contributions at low resolution where lighting changes slowly and at high resolution near discontinuities; this greatly reduces GPU fill rate and enhances performance. The second method, image space radiosity, significantly improves the performance of multiresolution splatting, introducing an efficient stencil-based parallel refinement technique. This method also adapts ideas from object-space hierarchical radiosity methods to image space, introducing two adaptive sampling methods that allow much finer sampling of the reflective shadow map where needed. These modifications significantly improve temporal coherence while maintaining performance. The third approach adapts these techniques to accelerate the rendering of direct illumination from large area light sources. Visibility is computed using a coarse screen-space voxelization technique, allowing binary visibility queries using ray marching. This work also proposes a new incremental refinement method that considers both illumination and visibility variations. Both diffuse and non-diffuse surfaces are supported, and illumination can vary over the surface of the light, enabling dynamic content such as video screens. global illumination gpu rendering Computer Sciences
68	Radar Signal Processing with Graphics Processors (GPUS) Pettersson, Jimmy, Wainwright, Ian January 2010 (has links) No description available. GPU GPGPU radar signal processing hpc
69	GPU programming for real-time watercolor simulation Scott, Jessica Stacy 17 February 2005 (has links) This thesis presents a method for combining GPU programming with traditional programming to create a ﬂuid simulation based watercolor tool for artists. This application provides a graphical interface and a canvas upon which artists can create simulated watercolors in real time. The GPU, or Graphics Processing Unit, is an effcient and highly parallel processor located on the graphics card of a computer; GPU programming is touted as a way to improve performance in graphics and nongraphics applications. The effectiveness of this method in speeding up large, general purpose programs, however, is found here to be disappointing. In a small application with minimal CPU/GPU interaction, theoretical speedups of 10 times maybe achieved, but with the limitations of communication speed between the GPU and the CPU, gains are slight when this method is used in conjunction with traditional programming. graphics artists' tools watercolor simulation GPU programming
70	An Implementation of the Discontinuous Galerkin Method on Graphics Processing Units Fuhry, Martin 10 April 2013 (has links) Computing highly-accurate approximate solutions to partial differential equations (PDEs) requires both a robust numerical method and a powerful machine. We present a parallel implementation of the discontinuous Galerkin (DG) method on graphics processing units (GPUs). In addition to being flexible and highly accurate, DG methods accommodate parallel architectures well, as their discontinuous nature produces entirely element-local approximations. While GPUs were originally intended to compute and display computer graphics, they have recently become a popular general purpose computing device. These cheap and extremely powerful devices have a massively parallel structure. With the recent addition of double precision floating point number support, GPUs have matured as serious platforms for parallel scientific computing. In this thesis, we present an implementation of the DG method applied to systems of hyperbolic conservation laws in two dimensions on a GPU using NVIDIA’s Compute Unified Device Architecture (CUDA). Numerous computed examples from linear advection to the Euler equations demonstrate the modularity and usefulness of our implementation. Benchmarking our method against a single core, serial implementation of the DG method reveals a speedup of a factor of over fifty times using a USD $500.00 NVIDIA GTX 580. dg gpu numerical PDE CUDA Applied Mathematics

Search results