1 |
Acceleration of Block-Aware Matrix Factorization on Heterogeneous PlatformsSomers, Gregory W. January 2016 (has links)
Block-structured matrices arise in several contexts in circuit
simulation problems. These matrices typically inherit the pattern of
sparsity from the circuit connectivity. However, they are also
characterized by dense spots or blocks. Direct factorization of those
matrices has emerged as an attractive approach if the host memory is sufficiently large to store the block-structured matrix. The approach proposed in this thesis aims to accelerate the direct factorization of general block-structured matrices by leveraging the power of multiple OpenCL accelerators such as Graphical Processing Units (GPUs).
The proposed approach utilizes the notion of a Directed Acyclic Graph representing the matrix in order to schedule its factorization on multiple accelerators. This thesis also describes memory management techniques that enable handling large matrices while minimizing the amount of memory transfer over the PCIe bus between the host CPU and the attached devices. The results demonstrate that by using two GPUs the proposed approach can achieve a nearly optimal speedup when compared to a
single GPU platform.
|
2 |
A Multi-GPU Compute Solution for Optimized Genomic Selection AnalysisDevore, Trevor 01 June 2014 (has links) (PDF)
Many modern-day Bioinformatics algorithms rely heavily on statistical models to analyze their biological data. Some of these statistical models lend themselves nicely to standard high performance computing optimizations such as parallelism, while others do not. One such algorithm is Markov Chain Monte Carlo (MCMC). In this thesis, we present a heterogeneous compute solution for optimizing GenSel, a genetic selection analysis tool. GenSel utilizes a MCMC algorithm to perform Bayesian inference using Gibbs sampling.
Optimizing an MCMC algorithm is a difficult problem because it is inherently sequential, containing a loop carried dependence between each Markov Chain iteration. The optimization presented in this thesis utilizes GPU computing to exploit the data-level parallelism within each of these iterations. In addition, it allows for the efficient management of memory, the pipelining of CUDA kernels, and the use of multiple GPUs. The optimizations presented show performance improvements of up to 1.84 times that of the original algorithm.
|
3 |
Multi-GPU Load Balancing for Simulation and RenderingHagan, Robert Douglas 04 August 2011 (has links)
GPU computing can significantly improve performance by taking advantage of massive parallelism of GPUs for data parallel applications. Computation in visualization applications is suitable for parallelization on the GPU, which can improve performance and interactivity in these applications. If used effectively, multiple GPUs can lead to a significant speedup over a single GPU. However, the use of multiple GPUs requires memory management, scheduling, and load balancing to ensure that a program takes full advantage of available processors. This work presents methods for data-driven and dynamic multi-GPU load balancing using a pipelined approach and a framework for use with different applications. Data-driven load balancing can improve utilization for applications by taking into account past performance for different combinations of input parameters. The dynamic load balancing method based on buffer fullness can adjust to workload changes at runtime to gain an additional performance improvement. This work provides a framework for load balancing to account for differing characteristics of applications. Implementation of a multi-GPU data structure allows for use of these load balancing methods in the framework. The effectiveness of the framework is demonstrated with performance results from interactive visualization that shows a significant speedup due to load balancing. / Master of Science
|
4 |
Enhancements to Reconstruction Techniques in Computed Tomography Using High Performance ComputingEliuk, Steven N Unknown Date
No description available.
|
5 |
Design and Optimization of OpenFOAM-based CFD Applications for Modern Hybrid and Heterogeneous HPC PlatformsAlOnazi, Amani 02 1900 (has links)
The progress of high performance computing platforms is dramatic, and most of the simulations carried out on these platforms result in improvements on one level, yet expose shortcomings of current CFD packages. Therefore, hardware-aware design and optimizations are crucial towards exploiting modern computing resources. This thesis proposes optimizations aimed at accelerating numerical simulations, which are illus- trated in OpenFOAM solvers. A hybrid MPI and GPGPU parallel conjugate gradient linear solver has been designed and implemented to solve the sparse linear algebraic kernel that derives from two CFD solver: icoFoam, which is an incompressible flow solver, and laplacianFoam, which solves the Poisson equation, for e.g., thermal dif- fusion. A load-balancing step is applied using heterogeneous decomposition, which decomposes the computations taking into account the performance of each comput- ing device and seeking to minimize communication. In addition, we implemented the recently developed pipeline conjugate gradient as an algorithmic improvement, and parallelized it using MPI, GPGPU, and a hybrid technique. While many questions of ultimately attainable per node performance and multi-node scaling remain, the ex- perimental results show that the hybrid implementation of both solvers significantly outperforms state-of-the-art implementations of a widely used open source package.
|
6 |
Adéquation Algorithme Architecture et modèle de programmation pour l'implémentation d'algorithmes de traitement du signal et de l'image sur cluster multi-GPU / Programming model for the implementation of 2D-3D image processing applications on a hybrid CPU-GPU cluster.Boulos, Vincent 18 December 2012 (has links)
Initialement con¸cu pour d´echarger le CPU des tˆaches de rendu graphique, le GPU estdevenu une architecture massivement parall`ele adapt´ee au traitement de donn´ees volumineuses.Alors qu’il occupe une part de march´e importante dans le Calcul Haute Performance, uned´emarche d’Ad´equation Algorithme Architecture est n´eanmoins requise pour impl´ementerefficacement un algorithme sur GPU.La contribution de cette th`ese est double. Dans un premier temps, nous pr´esentons legain significatif apport´e par l’impl´ementation optimis´ee d’un algorithme de granulom´etrie(l’ordre de grandeur passe de l’heure `a la minute pour un volume de 10243 voxels). Un mod`eleanalytique permettant d’´etablir les variations de performance de l’application de granulom´etriesur GPU a ´egalement ´et´e d´efini et pourrait ˆetre ´etendu `a d’autres algorithmes r´eguliers.Dans un second temps, un outil facilitant le d´eploiement d’applications de Traitementdu Signal et de l’Image sur cluster multi-GPU a ´et´e d´evelopp´e. Pour cela, le champ d’actiondu programmeur est r´eduit au d´ecoupage du programme en tˆaches et `a leur mapping sur les´el´ements de calcul (GPP ou GPU). L’am´elioration notable du d´ebit sortant d’une applicationstreaming de calcul de carte de saillence visuelle a d´emontr´e l’efficacit´e de notre outil pourl’impl´ementation d’une solution sur cluster multi-GPU. Afin de permettre un ´equilibrage decharge dynamique, une m´ethode de migration de tˆaches a ´egalement ´et´e incorpor´ee `a l’outil. / Originally designed to relieve the CPU from graphics rendering tasks, the GPU has becomea massively parallel architecture suitable for processing large amounts of data. While it haswon a significant market share in the High Performance Computing domain, an Algorithm-Architecture Matching approach is still necessary to efficiently implement an algorithm onGPU.The contribution of this thesis is twofold. Firstly, we present the significant gain providedby the implementation of a granulometry optimized algorithm (computation time decreasesfrom several hours to less than minute for a volume of 10243 voxels). An analytical modelestablishing the performance variations of the granulometry application is also presented. Webelieve it can be expanded to other regular algorithms.Secondly, the deployment of Signal and Image processing applications on multi-GPUcluster can be a tedious task for the programmer. In order to help him, we developped alibrary that reduces the scope of the programmer’s contribution in the development. Hisremaining tasks are decomposing the application into a Data Flow Graph and giving mappingannotations in order for the tool to automatically dispatch tasks on the processing elements(GPP or GPU). The throughput of a visual sailency streaming application is then improvedthanks to the efficient implementation brought by our tool on a multi-GPU cluster. In orderto permit dynamic load balancing, a task migration method has also been incorporated into it.
|
7 |
Simulações computacionais de arritmias cardíacas em ambientes de computação de alto desempenho do tipo Multi-GPUBarros, Bruno Gouvêa de 25 February 2013 (has links)
Submitted by Renata Lopes (renatasil82@gmail.com) on 2017-02-24T12:24:27Z
No. of bitstreams: 1
brunogouveadebarros.pdf: 4637517 bytes, checksum: 0db5f859f17bd37484772dd26a331ce5 (MD5) / Approved for entry into archive by Adriana Oliveira (adriana.oliveira@ufjf.edu.br) on 2017-02-24T15:33:28Z (GMT) No. of bitstreams: 1
brunogouveadebarros.pdf: 4637517 bytes, checksum: 0db5f859f17bd37484772dd26a331ce5 (MD5) / Made available in DSpace on 2017-02-24T15:33:28Z (GMT). No. of bitstreams: 1
brunogouveadebarros.pdf: 4637517 bytes, checksum: 0db5f859f17bd37484772dd26a331ce5 (MD5)
Previous issue date: 2013-02-25 / FAPEMIG - Fundação de Amparo à Pesquisa do Estado de Minas Gerais / Os modelos computacionais tornaram-se ferramentas valiosas para o estudo e compreensão
dos fenômenos da eletrofisiologia cardíaca. No entanto, a elevada complexidade dos
processos biofísicos e o nível microscópico de detalhes exigem complexos modelos
computacionais. Aspectos-chave da eletrofisiologia cardíaca, tais como condução lenta
e bloqueio de condução tem sido tema de pesquisa de muitos estudos, uma vez que estão
fortemente relacionados à arritmia cardíaca. No entanto, ao reproduzir estes fenômenos
os modelos necessitam de uma discretização sub-celular para a solução das equações
diferenciais e uma condutividade eléctrica do tecido não uniforme e heterogênea. Devido
aos elevados custos computacionais de simulações que reproduzem a microestrutura
fina do tecido cardíaco, estudos prévios têm considerado experimentos de tecido de
pequenas dimensões e têm utilizados modelos simples de células cardíacas. Neste trabalho,
desenvolvemos um modelo (modelo microscópico) da eletrofisiologia cardíaca que capta a
microestrutura do tecido cardíaco usando uma discretização espacial muito fina (8µm) e
utilizamos um modelo celular moderno e complexo baseado em Cadeias de Markov para
a caracterização da estrutura e dinâmica dos canais iônicos. Para lidar com os desafios
computacionais, o modelo foi paralelizado usando uma abordagem híbrida: a computação
em cluster e GPGPUs (General-purpose computing on Graphics Processing Units). Nossa
implementação paralela deste modelo, utilizando uma plataforma multi-GPU, foi capaz de
reduzir os tempos de execução das simulações de mais de 6 dias (em um único processador)
para 21 minutos (em um pequeno cluster de 8 nós equipado com 16 GPUs). Além disso,
para diminuir ainda mais o custo computacional, foi desenvolvido um modelo discreto
equivalente ao modelo microscópico. Este novo modelo foi paralelizado usando a mesma
abordagem do modelo microscópico e foi capaz de executar simulações que demoravam
21 minutos em apenas 65 segundos. Acreditamos que esta nova implementação paralela
abre caminho para a investigação de muitas questões em aberto associadas à natureza
complexa e discreta da propagação dos potenciais de ação no tecido cardíaco. / Computer models have become valuable tools for the study and comprehension of the
complex phenomena of cardiac electrophysiology. However, the high complexity of the
biophysical processes and the microscopic level of details demand complex mathematical
and computational models. Key aspects of cardiac electrophysiology, such as slow
conduction, conduction block and saltatory effects have been the research topic of many
studies since they are strongly related to cardiac arrhythmia. However, to reproduce these
phenomena the numerical models need to use sub-cellular discretization for the solution
of the PDEs and nonuniform, heterogeneous tissue electric conductivity. Due to the
high computational costs of simulations that reproduce the fine microstructure of cardiac
tissue, previous studies have considered tissue experiments of small or moderate sizes
and used simple cardiac cell models. In this work we develop a cardiac electrophysiology
model (microscopic model) that captures the microstructure of cardiac tissue by using
a very fine spatial discretization (8µm) and uses a very modern and complex cell model
based on Markov Chains for the characterization of ion channel's structure and dynamics.
To cope with the computational challenges, the model was parallelized using a hybrid
approach: cluster computing and GPGPUs (General-purpose computing on graphics
processing units). Our parallel implementation of this model using a Multi-GPU platform
was able to reduce the execution times of the simulations from more than 6 days (on a
single processor) to 21 minutes (on a small 8-node cluster equipped with 16 GPUs).
Furthermore, in order to decrease further the computational cost we have developed a
discrete model equivalent to the microscopic one. This new model was also parallelized
using the same approach as the microscopic model and was able to perform simulations
that took 21 minutes to be executed in just 65 seconds. We believe that this new parallel
implementation paves the way for the investigation of many open questions associated
|
8 |
Real-time Visualization of Massive 3D Models on GPU Parallel ArchitecturesPeng, Chao 24 April 2013 (has links)
Real-time rendering of massive 3D models has been recognized as a challenging task due to the limited computational power and memory available in a workstation. Most existing acceleration techniques, such as mesh simplification algorithms with hierarchical data structures, suffer from the nature of sequential executions. As data complexity increases due to the fundamental advances in modeling and simulation technologies, 3D models become complex and require gigabytes in storage. Consequently, visualizing such large datasets becomes a computationally intensive process where sequential solutions are unable to satisfy the demands of real-time rendering.
Recently, the Graphics Processing Unit (GPU) has been praised as a massively parallel architecture not only for its significant improvements in performance but also because of its programmability for general-purpose computation. Today's GPUs allow researchers to solve problems by delivering fine-grained parallel implementations. In this dissertation, I concentrate on the design of parallel algorithms for real-time rendering of massive 3D polygonal models towards modern GPU architectures. As a result, the delivered rendering system supports high-performance visualization of 3D models composed of hundreds of millions of polygons on a single commodity workstation. / Ph. D.
|
9 |
Multi-level Parallelism with MPI and OpenACC for CFD ApplicationsMcCall, Andrew James 14 June 2017 (has links)
High-level parallel programming approaches, such as OpenACC, have recently become popular in complex fluid dynamics research since they are cross-platform and easy to implement. OpenACC is a directive-based programming model that, unlike low-level programming models, abstracts the details of implementation on the GPU. Although OpenACC generally limits the performance of the GPU, this model significantly reduces the work required to port an existing code to any accelerator platform, including GPUs. The purpose of this research is twofold: to investigate the effectiveness of OpenACC in developing a portable and maintainable GPU-accelerated code, and to determine the capability of OpenACC to accelerate large, complex programs on the GPU. In both of these studies, the OpenACC implementation is optimized and extended to a multi-GPU implementation while maintaining a unified code base. OpenACC is shown as a viable option for GPU computing with CFD problems.
In the first study, a CFD code that solves incompressible cavity flows is accelerated using OpenACC. Overlapping communication with computation improves performance for the multi-GPU implementation by up to 21%, achieving up to 400 times faster performance than a single CPU and 99% weak scalability efficiency with 32 GPUs.
The second study ports the execution of a more complex CFD research code to the GPU using OpenACC. Challenges using OpenACC with modern Fortran are discussed. Three test cases are used to evaluate performance and scalability. The multi-GPU performance using 27 GPUs is up to 100 times faster than a single CPU and maintains a weak scalability efficiency of 95%. / Master of Science / The research and analysis performed in scientific computing today produces an ever-increasing demand for faster and more energy efficient performance. Parallel computing with supercomputers that use many central processing units (CPUs) is the current standard for satisfying these demands. The use of graphics processing units (GPUs) for scientific computing applications is an emerging technology that has gained a lot of popularity in the past decade. A single GPU can distribute the computations required by a program over thousands of processing units.
This research investigates the effectiveness of a relatively new standard, called OpenACC, for offloading execution of a program to the GPU. The most widely used standards today are highly complex and require low-level, detailed knowledge of the GPU’s architecture. These issues significantly reduce the maintainability and portability of a program. OpenACC does not require rewriting a program for the GPU. Instead, the developer annotates regions of code to run on the GPU and only has to denote high-level information about how to parallelize the code.
The results of this research found that even for a complex program that models air flows, using OpenACC to run the program on 27 GPUs increases performance by a factor of 100 over a single CPU and by a factor of 4 over 27 CPUs. Although higher performance is expected with other GPU programming standards, these results were accomplished with minimal change to the original program. Therefore, these results demonstrate the ability of OpenACC to improve performance while keeping the program maintainable and portable.
|
10 |
Repeatable high-resolution statistical downscaling through deep learningQuesada-Chacón, Dánnell, Barfus, Klemens, Bernhofer, Christian 04 June 2024 (has links)
One of the major obstacles for designing solutions against the imminent climate crisis is the scarcity of high spatio-temporal resolution model projections for variables such as precipitation. This kind of information is crucial for impact studies in fields like hydrology, agronomy, ecology, and risk management. The currently highest spatial resolution datasets on a daily scale for projected conditions fail to represent complex local variability. We used deep-learning-based statistical downscaling methods to obtain daily 1 km resolution gridded data for precipitation in the Eastern Ore Mountains in Saxony, Germany. We built upon the well-established climate4R framework, while adding modifications to its base-code, and introducing skip connections-based deep learning architectures, such as U-Net and U-Net++. We also aimed to address the known general reproducibility issues by creating a containerized environment with multi-GPU (graphic processing unit) and TensorFlow's deterministic operations support. The perfect prognosis approach was applied using the ERA5 reanalysis and the ReKIS (Regional Climate Information System for Saxony, Saxony-Anhalt, and Thuringia) dataset. The results were validated with the robust VALUE framework. The introduced architectures show a clear performance improvement when compared to previous statistical downscaling benchmarks. The best performing architecture had a small increase in total number of parameters, in contrast with the benchmark, and a training time of less than 6 min with one NVIDIA A-100 GPU. Characteristics of the deep learning models configurations that promote their suitability for this specific task were identified, tested, and argued. Full model repeatability was achieved employing the same physical GPU, which is key to build trust in deep learning applications. The EURO-CORDEX dataset is meant to be coupled with the trained models to generate a high-resolution ensemble, which can serve as input to multi-purpose impact models.
|
Page generated in 0.0297 seconds