Spelling suggestions: "subject:"aprocessing unit"" "subject:"eprocessing unit""
31 |
Performance Modeling, Optimization, and Characterization on Heterogeneous ArchitecturesPanwar, Lokendra Singh 21 October 2014 (has links)
Today, heterogeneous computing has truly reshaped the way scientists think and approach high-performance computing (HPC). Hardware accelerators such as general-purpose graphics processing units (GPUs) and Intel Many Integrated Core (MIC) architecture continue to make in-roads in accelerating large-scale scientific applications. These advancements, however, introduce new sets of challenges to the scientific community such as: selection of best processor for an application, effective performance optimization strategies, maintaining performance portability across architectures etc. In this thesis, we present our techniques and approach to address some of these significant issues.
Firstly, we present a fully automated approach to project the relative performance of an OpenCL program over different GPUs. Performance projections can be made within a small amount of time, and the projection overhead stays relatively constant with the input data size. As a result, the technique can help runtime tools make dynamic decisions about which GPU would run faster for a given kernel. Usage cases of this technique include scheduling or migrating GPU workloads over a heterogeneous cluster with different types of GPUs.
We then present our approach to accelerate a seismology modeling application that is based on the finite difference method (FDM), using MPI and CUDA over a hybrid CPU+GPU cluster. We describe the generic computational complexities involved in porting such applications to the GPUs and present our strategy of efficient performance optimization and characterization. We also show how performance modeling can be used to reason and drive the hardware-specific optimizations on the GPU. The performance evaluation of our approach delivers a maximum speedup of 23-fold with a single GPU and 33-fold with dual GPUs per node over the serial version of the application, which in turn results in a many-fold speedup when coupled with the MPI distribution of the computation across the cluster. We also study the efficacy of GPU-integrated MPI, with MPI-ACC as an example implementation, in a seismology modeling application and discuss the lessons learned. / Master of Science
|
32 |
Multi-level Parallelism with MPI and OpenACC for CFD ApplicationsMcCall, Andrew James 14 June 2017 (has links)
High-level parallel programming approaches, such as OpenACC, have recently become popular in complex fluid dynamics research since they are cross-platform and easy to implement. OpenACC is a directive-based programming model that, unlike low-level programming models, abstracts the details of implementation on the GPU. Although OpenACC generally limits the performance of the GPU, this model significantly reduces the work required to port an existing code to any accelerator platform, including GPUs. The purpose of this research is twofold: to investigate the effectiveness of OpenACC in developing a portable and maintainable GPU-accelerated code, and to determine the capability of OpenACC to accelerate large, complex programs on the GPU. In both of these studies, the OpenACC implementation is optimized and extended to a multi-GPU implementation while maintaining a unified code base. OpenACC is shown as a viable option for GPU computing with CFD problems.
In the first study, a CFD code that solves incompressible cavity flows is accelerated using OpenACC. Overlapping communication with computation improves performance for the multi-GPU implementation by up to 21%, achieving up to 400 times faster performance than a single CPU and 99% weak scalability efficiency with 32 GPUs.
The second study ports the execution of a more complex CFD research code to the GPU using OpenACC. Challenges using OpenACC with modern Fortran are discussed. Three test cases are used to evaluate performance and scalability. The multi-GPU performance using 27 GPUs is up to 100 times faster than a single CPU and maintains a weak scalability efficiency of 95%. / Master of Science / The research and analysis performed in scientific computing today produces an ever-increasing demand for faster and more energy efficient performance. Parallel computing with supercomputers that use many central processing units (CPUs) is the current standard for satisfying these demands. The use of graphics processing units (GPUs) for scientific computing applications is an emerging technology that has gained a lot of popularity in the past decade. A single GPU can distribute the computations required by a program over thousands of processing units.
This research investigates the effectiveness of a relatively new standard, called OpenACC, for offloading execution of a program to the GPU. The most widely used standards today are highly complex and require low-level, detailed knowledge of the GPU’s architecture. These issues significantly reduce the maintainability and portability of a program. OpenACC does not require rewriting a program for the GPU. Instead, the developer annotates regions of code to run on the GPU and only has to denote high-level information about how to parallelize the code.
The results of this research found that even for a complex program that models air flows, using OpenACC to run the program on 27 GPUs increases performance by a factor of 100 over a single CPU and by a factor of 4 over 27 CPUs. Although higher performance is expected with other GPU programming standards, these results were accomplished with minimal change to the original program. Therefore, these results demonstrate the ability of OpenACC to improve performance while keeping the program maintainable and portable.
|
33 |
Otimização de multidões em jogos digitais utilizando CUDABardella, Tiago Ungaro 19 October 2015 (has links)
Made available in DSpace on 2016-03-15T19:38:03Z (GMT). No. of bitstreams: 1
TIAGO UNGARO BARDELLA.pdf: 2553991 bytes, checksum: f8e6ba33f7c930ee81f6b64116f495ff (MD5)
Previous issue date: 2015-10-19 / The history of digital games shows, since the beginning, games which uses many types of enemy models to confront and many types of characters to control, like Real-Time Strategy games, for example. These huge amount of models into an important scene are called crowds. The crowds needs a high computer performance and specific algorithms in their interaction control to avoid immersion loss into a game by problems which may
happen if the crowds are not treated accordingly. With the popularization of graphic board languages like NVIDIA CUDA, new algorithms were created to easily increase the performance of crowds in digital games and their overwhelming superiority compared to the methods used in linear programming were proved in many researches. The goal of this work is to use these GPU techniques as base to implement a new API using CUDA
language that will present better performance and simplicity compared to the others algorithms on the area of crowds in digital games. After the project conclusion, the created
API turned easier the crowd treatment to digital game developers using Unity3D integrated with API TBX, that now only need to include a DLL in the project instead creating na algorithm for crowd treatment from the beginning, which takes a huge amount of time from development. / O histórico dos jogos digitais apresenta, desde seu princípio, jogos que utilizam diversos modelos de inimigos para enfrentar ou diversos modelos de personagens para controlar, como os jogos Real-Time Strategy por exemplo. Essas grandes quantidades de modelos que compõem uma cena importante são chamadas de multidões. As multidões necessitam de um alto poder computacional e algoritmos específicos para seu tratamento para evitar a perda de imersão dentro de um jogo pelos problemas que podem acontecer caso as multidões não sejam tratadas adequadamente. Com o surgimento de linguagens de placas
gráficas como a NVIDIA CUDA, novos algoritmos foram criados para melhor trabalhar com o desempenho de multidões em jogos digitais e sua superioridade em comparação com os métodos utilizados em programação sequencial foi comprovada em diversos estudos. O objetivo deste trabalho é se basear nestas técnicas de GPU para implementar uma nova API usando tecnologia CUDA que visa melhorar os algoritmos existentes para
tratamento de multidões em jogos digitais em termos de desempenho e simplicidade de implementação. Com a conclusão do projeto, a API criada facilitou o tratamento de multidões para desenvolvedores de jogos digitais com a game engine Unity3D integrada com a API TBX de simulação de multidões, que agora apenas precisam incluir uma DLL em seu projeto ao invés de criar um algoritmo próprio de tratamento de multidões do início,
o que demanda tempo de desenvolvimento.
|
34 |
Cooperative Execution of Opencl Programs on Multiple Heterogeneous DevicesPandit, Prasanna Vasant January 2013 (has links) (PDF)
Computing systems have become heterogeneous with the increasing prevalence of multi-core CPUs, Graphics Processing Units (GPU) and other accelerators in them. OpenCL has emerged as an attractive programming framework for heterogeneous systems. However, utilizing mul- tiple devices in OpenCL is a challenge as it requires the programmer to explicitly map data and computation to each device. Utilizing multiple devices simultaneously to speed up execu- tion of a kernel is even more complex, as the relative execution time of the kernel on different devices can vary significantly. Also, after each kernel execution, a coherent version of the data needs to be established. This means that, in order to utilize all devices effectively, the programmer has to spend considerable time and effort to distribute work across all devices, keep track of modified data in these devices and correctly perform a merging step to put the data together. Further, the relative performance of a program may vary across different inputs, which means a statically determined work distribution may not work well.
In this work, we present FluidiCL, an OpenCL runtime that takes a program written for a single device and uses multiple heterogeneous devices to execute each kernel. The runtime performs dynamic work distribution and cooperatively executes each kernel on all available devices. Since we consider a setup with devices having discrete address spaces, our solution ensures that execution of OpenCL work-groups on devices is adjusted by taking into account the overheads for data management. The data transfers and data merging needed to ensure coherence are handled transparently without requiring any effort from the programmer. Flu- idiCL also does not require prior training or profiling and is completely portable across dif- ferent machines. Because it is dynamic, the runtime is able to adapt to system load. We have developed several optimizations for improving the performance of FluidiCL. We evaluate the runtime across different sets of devices. On a machine with an Intel quad-core processor and an NVidia Fermi GPU, FluidiCL shows a geomean speedup of nearly 64% over the GPU, 88% over the CPU and 14% over the best of the two devices in each benchmark. In all benchmarks, performance of our runtime comes to within 13% of the best of the two devices. FluidiCL shows similar results on a machine with a quad-core CPU and an NVidia Kepler GPU, with up to 26% speedup over the best of the two. We also present results considering an Intel Xeon Phi accelerator and a CPU and find that FluidiCL performs up to 45% faster than the best of the two devices. We extend FluidiCL from a CPU–GPU scenario to a three-device setup hav- ing a quad-core CPU, an NVidia Kepler GPU and an Intel Xeon Phi accelerator and find that FluidiCL obtains a geomean improvement of 6% in kernel execution time over the best of the three devices considered in each case.
|
35 |
Investigation of hierarchical deep neural network structure for facial expression recognitionMotembe, Dodi 01 1900 (has links)
Facial expression recognition (FER) is still a challenging concept, and machines struggle to
comprehend effectively the dynamic shifts in facial expressions of human emotions. The
existing systems, which have proven to be effective, consist of deeper network structures that
need powerful and expensive hardware. The deeper the network is, the longer the training and
the testing. Many systems use expensive GPUs to make the process faster. To remedy the
above challenges while maintaining the main goal of improving the accuracy rate of the
recognition, we create a generic hierarchical structure with variable settings. This generic
structure has a hierarchy of three convolutional blocks, two dropout blocks and one fully
connected block. From this generic structure we derived four different network structures to
be investigated according to their performances. From each network structure case, we again
derived six network structures in relation to the variable parameters. The variable parameters
under analysis are the size of the filters of the convolutional maps and the max-pooling as
well as the number of convolutional maps. In total, we have 24 network structures to
investigate, and six network structures per case. After simulations, the results achieved after
many repeated experiments showed in the group of case 1; case 1a emerged as the top
performer of that group, and case 2a, case 3c and case 4c outperformed others in their
respective groups. The comparison of the winners of the 4 groups indicates that case 2a is the
optimal structure with optimal parameters; case 2a network structure outperformed other
group winners. Considerations were done when choosing the best network structure,
considerations were; minimum accuracy, average accuracy and maximum accuracy after 15
times of repeated training and analysis of results. All 24 proposed network structures were
tested using two of the most used FER datasets, the CK+ and the JAFFE. After repeated
simulations the results demonstrate that our inexpensive optimal network architecture
achieved 98.11 % accuracy using the CK+ dataset. We also tested our optimal network
architecture with the JAFFE dataset, the experimental results show 84.38 % by using just a
standard CPU and easier procedures. We also compared the four group winners with other
existing FER models performances recorded recently in two studies. These FER models used
the same two datasets, the CK+ and the JAFFE. Three of our four group winners (case 1a,
case 2a and case 4c) recorded only 1.22 % less than the accuracy of the top performer model
when using the CK+ dataset, and two of our network structures, case 2a and case 3c came in
third, beating other models when using the JAFFE dataset. / Electrical and Mining Engineering
|
36 |
Ray Tracing Bézier Surfaces on GPULöw, Joakim January 2006 (has links)
<p>In this report, we show how to implement direct ray tracing of B´ezier surfaces on graphics processing units (GPUs), in particular bicubic rectangular Bézier surfaces and nonparametric cubic Bézier triangles. We use Newton’s method for the rectangular case and show how to use this method to find the ray-surface intersection. For Newton’s method to work we must build a spatial partitioning hierarchy around each surface patch, and in general, hierarchies are essential to speed up the process of ray tracing. We have chosen to use bounding box hierarchies and show how to implement stackless traversal of such a structure on a GPU. For the nonparametric triangular case, we show how to find the wanted intersection by simply solving a cubic polynomial. Because of the limited precision of current GPUs, we also propose a numerical approach to solve the problem, using a one-dimensional Newton search.</p>
|
37 |
Ray Tracing Bézier Surfaces on GPULöw, Joakim January 2006 (has links)
In this report, we show how to implement direct ray tracing of B´ezier surfaces on graphics processing units (GPUs), in particular bicubic rectangular Bézier surfaces and nonparametric cubic Bézier triangles. We use Newton’s method for the rectangular case and show how to use this method to find the ray-surface intersection. For Newton’s method to work we must build a spatial partitioning hierarchy around each surface patch, and in general, hierarchies are essential to speed up the process of ray tracing. We have chosen to use bounding box hierarchies and show how to implement stackless traversal of such a structure on a GPU. For the nonparametric triangular case, we show how to find the wanted intersection by simply solving a cubic polynomial. Because of the limited precision of current GPUs, we also propose a numerical approach to solve the problem, using a one-dimensional Newton search.
|
38 |
Real-time Arbitrary View Rendering From Stereo Video And Time-of-flight CameraAtes, Tugrul Kagan 01 January 2011 (has links) (PDF)
Generating in-between images from multiple views of a scene is a crucial task for both computer vision and computer graphics fields. Photorealistic rendering, 3DTV and robot navigation are some of many applications which benefit from arbitrary view synthesis, if it is achieved in real-time. Most modern commodity computer architectures include programmable processing chips, called Graphics Processing Units (GPU), which are specialized in rendering computer generated images. These devices excel in achieving high computation power by processing arrays of data in parallel, which make them ideal for real-time computer vision applications. This thesis focuses on an arbitrary view rendering algorithm by using two high resolution color cameras along with a single low resolution time-of-flight depth camera and matching the programming paradigms of the GPUs to achieve real-time processing rates. Proposed method is divided into two stages. Depth estimation through fusion of stereo vision and time-of-flight measurements forms the data acquisition stage and second stage is intermediate view rendering from 3D representations of scenes. Ideas presented are examined in a common experimental framework and practical results attained are put forward. Based on the experimental results, it could be concluded that it is possible to realize content production and display stages of a free-viewpoint system in real-time by using only low cost commodity computing devices.
|
39 |
Pricing of American Options by Adaptive Tree Methods on GPUsLundgren, Jacob January 2015 (has links)
An assembled algorithm for pricing American options with absolute, discrete dividends using adaptive lattice methods is described. Considerations for hardware-conscious programming on both CPU and GPU platforms are discussed, to provide a foundation for the investigation of several approaches for deploying the program onto GPU architectures. The performance results of the approaches are compared to that of a central processing unit reference implementation, and to each other. In particular, an approach of designating subtrees to be calculated in parallel by allowing multiple calculation of overlapping elements is described. Among the examined methods, this attains the best performance results in a "realistic" region of calculation parameters. A fifteen- to thirty-fold improvement in performance over the CPU reference implementation is observed as the problem size grows sufficiently large.
|
40 |
A parallel model for the heterogeneous computation of radio astronomy signal correlationHarris, Christopher John January 2009 (has links)
The computational requirements of scientific research are constantly growing. In the field of radio astronomy, observations have evolved from using single telescopes, to interferometer arrays of many telescopes, and there are currently arrays of massive scale under development. These interferometers use signal and image processing to produce data that is useful to radio astronomy, and the amount of processing required scales quadratically with the scale of the array. Traditional computational approaches are unable to meet this demand in the near future. This thesis explores the use of heterogeneous parallel processing to meet the computational demands of radio astronomy. In heterogeneous computing, multiple hardware architectures are used for processing. In this work, the Graphics Processing Unit (GPU) is used as a co-processor along with the Central Processing Unit (CPU) for the computation of signal processing algorithms. Specifically, the suitability of the GPU to accelerate the correlator algorithms used in radio astronomy is investigated. This work first implemented a FX correlator on the GPU, with a performance increase of one to two orders of magnitude over a serial CPU approach. The FX correlator algorithm combines pairs of telescope signals in the Fourier domain. Given N telescope signals from the interferometer array, N2 conjugate multiplications must be calculated in the algorithm. For extremely large arrays (N >> 30), this is a huge computational requirement. Testing will show that the GPU correlator produces results equivalent to that of a software correlator implemented on the CPU. However, the algorithm itself is adapted in order to take advantage of the processing power of the GPU. Research examined how correlator parameters, in particular the number of telescope signals and the Fast Fourier Transform (FFT) length, affected the results.
|
Page generated in 0.1065 seconds