Architecture-Aware Mapping and Optimization on Heterogeneous Computing Systems

Daga, Mayank 06 June 2011 (has links)
The emergence of scientific applications embedded with multiple modes of parallelism has made heterogeneous computing systems indispensable in high performance computing. The popularity of such systems is evident from the fact that three out of the top five fastest supercomputers in the world employ heterogeneous computing, i.e., they use dissimilar computational units. A closer look at the performance of these supercomputers reveals that they achieve only around 50% of their theoretical peak performance. This suggests that applications that were tuned for erstwhile homogeneous computing may not be efficient for today's heterogeneous computing and hence, novel optimization strategies are required to be exercised. However, optimizing an application for heterogeneous computing systems is extremely challenging, primarily due to the architectural differences in computational units in such systems. This thesis intends to act as a cookbook for optimizing applications on heterogeneous computing systems that employ graphics processing units (GPUs) as the preferred mode of accelerators. We discuss optimization strategies for multicore CPUs as well as for the two popular GPU platforms, i.e., GPUs from AMD and NVIDIA. Optimization strategies for NVIDIA GPUs have been well studied but when applied on AMD GPUs, they fail to measurably improve performance because of the differences in underlying architecture. To the best of our knowledge, this research is the first to propose optimization strategies for AMD GPUs. Even on NVIDIA GPUs, there exists a lesser known but an extremely severe performance pitfall called partition camping, which can affect application performance by up to seven-fold. To facilitate the detection of this phenomenon, we have developed a performance prediction model that analyzes and characterizes the effect of partition camping in GPU applications. We have used a large-scale, molecular modeling application to validate and verify all the optimization strategies. Our results illustrate that if appropriately optimized, AMD and NVIDIA GPUs can provide 371-fold and 328-fold improvement, respectively, over a hand-tuned, SSE-optimized serial implementation. / Master of Science

On the Complexity of Robust Source-to-Source Translation from CUDA to OpenCL

Sathre, Paul Daniel 12 June 2013 (has links)
The use of hardware accelerators in high-performance computing has grown increasingly prevalent, particularly due to the growth of graphics processing units (GPUs) as general-purpose (GPGPU) accelerators. Much of this growth has been driven by NVIDIA's CUDA ecosystem for developing GPGPU applications on NVIDIA hardware. However, with the increasing diversity of GPUs (including those from AMD, ARM, and Qualcomm), OpenCL has emerged as an open and vendor-agnostic environment for programming GPUs as well as other parallel computing devices such as the CPU (central processing unit), APU (accelerated processing unit), FPGA (field programmable gate array), and DSP (digital signal processor). The above, coupled with the broader array of devices supporting OpenCL and the significant conceptual and syntactic overlap between CUDA and OpenCL, motivated the creation of a CUDA-to-OpenCL source-to-source translator. However, there exist sufficient differences that make the translation non-trivial, providing practical limitations to both manual and automatic translation efforts. In this thesis, the performance, coverage, and reliability of a prototype CUDA-to-OpenCL source translator are addressed via extensive profiling of a large body of sample CUDA applications. An analysis of the sample body of applications is provided, which identifies and characterizes general CUDA source constructs and programming practices that obstruct our translation efforts. This characterization then led to more robust support for the translator, followed by an evaluation that demonstrated the performance of our automatically-translated OpenCL is on par with the original CUDA for a subset of sample applications when executed on the same NVIDIA device. / Master of Science

On the Enhancement of Remote GPU Virtualization in High Performance Clusters

Reaño González, Carlos 01 September 2017 (has links)
Graphics Processing Units (GPUs) are being adopted in many computing facilities given their extraordinary computing power, which makes it possible to accelerate many general purpose applications from different domains. However, GPUs also present several side effects, such as increased acquisition costs as well as larger space requirements. They also require more powerful energy supplies. Furthermore, GPUs still consume some amount of energy while idle and their utilization is usually low for most workloads. In a similar way to virtual machines, the use of virtual GPUs may address the aforementioned concerns. In this regard, the remote GPU virtualization mechanism allows an application being executed in a node of the cluster to transparently use the GPUs installed at other nodes. Moreover, this technique allows to share the GPUs present in the computing facility among the applications being executed in the cluster. In this way, several applications being executed in different (or the same) cluster nodes can share one or more GPUs located in other nodes of the cluster. Sharing GPUs should increase overall GPU utilization, thus reducing the negative impact of the side effects mentioned before. Reducing the total amount of GPUs installed in the cluster may also be possible. In this dissertation we enhance one framework offering remote GPU virtualization capabilities, referred to as rCUDA, for its use in high-performance clusters. While the initial prototype version of rCUDA demonstrated its functionality, it also revealed concerns with respect to usability, performance, and support for new GPU features, which prevented its used in production environments. These issues motivated this thesis, in which all the research is primarily conducted with the aim of turning rCUDA into a production-ready solution for eventually transferring it to industry. The new version of rCUDA resulting from this work presents a reduction of up to 35% in execution time of the applications analyzed with respect to the initial version. Compared to the use of local GPUs, the overhead of this new version of rCUDA is below 5% for the applications studied when using the latest high-performance computing networks available. / Las unidades de procesamiento gráfico (Graphics Processing Units, GPUs) están siendo utilizadas en muchas instalaciones de computación dada su extraordinaria capacidad de cálculo, la cual hace posible acelerar muchas aplicaciones de propósito general de diferentes dominios. Sin embargo, las GPUs también presentan algunas desventajas, como el aumento de los costos de adquisición, así como mayores requerimientos de espacio. Asimismo, también requieren un suministro de energía más potente. Además, las GPUs consumen una cierta cantidad de energía aún estando inactivas, y su utilización suele ser baja para la mayoría de las cargas de trabajo. De manera similar a las máquinas virtuales, el uso de GPUs virtuales podría hacer frente a los inconvenientes mencionados. En este sentido, el mecanismo de virtualización remota de GPUs permite que una aplicación que se ejecuta en un nodo de un clúster utilice de forma transparente las GPUs instaladas en otros nodos de dicho clúster. Además, esta técnica permite compartir las GPUs presentes en el clúster entre las aplicaciones que se ejecutan en el mismo. De esta manera, varias aplicaciones que se ejecutan en diferentes nodos de clúster (o los mismos) pueden compartir una o más GPUs ubicadas en otros nodos del clúster. Compartir GPUs aumenta la utilización general de la GPU, reduciendo así el impacto negativo de las desventajas anteriormente mencionadas. De igual forma, este mecanismo también permite reducir la cantidad total de GPUs instaladas en el clúster. En esta tesis mejoramos un entorno de trabajo llamado rCUDA, el cual ofrece funcionalidades de virtualización remota de GPUs para su uso en clusters de altas prestaciones. Si bien la versión inicial del prototipo de rCUDA demostró su funcionalidad, también reveló dificultades con respecto a la usabilidad, el rendimiento y el soporte para nuevas características de las GPUs, lo cual impedía su uso en entornos de producción. Estas consideraciones motivaron la presente tesis, en la que toda la investigación llevada a cabo tiene como objetivo principal convertir rCUDA en una solución lista para su uso entornos de producción, con la finalidad de transferirla eventualmente a la industria. La nueva versión de rCUDA resultante de este trabajo presenta una reducción de hasta el 35% en el tiempo de ejecución de las aplicaciones analizadas con respecto a la versión inicial. En comparación con el uso de GPUs locales, la sobrecarga de esta nueva versión de rCUDA es inferior al 5% para las aplicaciones estudiadas cuando se utilizan las últimas redes de computación de altas prestaciones disponibles. / Les unitats de processament gràfic (Graphics Processing Units, GPUs) estan sent utilitzades en moltes instal·lacions de computació donada la seva extraordinària capacitat de càlcul, la qual fa possible accelerar moltes aplicacions de propòsit general de diferents dominis. No obstant això, les GPUs també presenten alguns desavantatges, com l'augment dels costos d'adquisició, així com major requeriment d'espai. Així mateix, també requereixen un subministrament d'energia més potent. A més, les GPUs consumeixen una certa quantitat d'energia encara estant inactives, i la seua utilització sol ser baixa per a la majoria de les càrregues de treball. D'una manera semblant a les màquines virtuals, l'ús de GPUs virtuals podria fer front als inconvenients esmentats. En aquest sentit, el mecanisme de virtualització remota de GPUs permet que una aplicació que s'executa en un node d'un clúster utilitze de forma transparent les GPUs instal·lades en altres nodes d'aquest clúster. A més, aquesta tècnica permet compartir les GPUs presents al clúster entre les aplicacions que s'executen en el mateix. D'aquesta manera, diverses aplicacions que s'executen en diferents nodes de clúster (o els mateixos) poden compartir una o més GPUs ubicades en altres nodes del clúster. Compartir GPUs augmenta la utilització general de la GPU, reduint així l'impacte negatiu dels desavantatges anteriorment esmentades. A més a més, aquest mecanisme també permet reduir la quantitat total de GPUs instal·lades al clúster. En aquesta tesi millorem un entorn de treball anomenat rCUDA, el qual ofereix funcionalitats de virtualització remota de GPUs per al seu ús en clústers d'altes prestacions. Si bé la versió inicial del prototip de rCUDA va demostrar la seua funcionalitat, també va revelar dificultats pel que fa a la usabilitat, el rendiment i el suport per a noves característiques de les GPUs, la qual cosa impedia el seu ús en entorns de producció. Aquestes consideracions van motivar la present tesi, en què tota la investigació duta a terme té com a objectiu principal convertir rCUDA en una solució preparada per al seu ús entorns de producció, amb la finalitat de transferir-la eventualment a la indústria. La nova versió de rCUDA resultant d'aquest treball presenta una reducció de fins al 35% en el temps d'execució de les aplicacions analitzades respecte a la versió inicial. En comparació amb l'ús de GPUs locals, la sobrecàrrega d'aquesta nova versió de rCUDA és inferior al 5% per a les aplicacions estudiades quan s'utilitzen les últimes xarxes de computació d'altes prestacions disponibles. / Reaño González, C. (2017). On the Enhancement of Remote GPU Virtualization in High Performance Clusters [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/86219 / Premios Extraordinarios de tesis doctorales

General Purpose Computing in Gpu - a Watermarking Case Study

Hanson, Anthony 08 1900 (has links)
The purpose of this project is to explore the GPU for general purpose computing. The GPU is a massively parallel computing device that has a high-throughput, exhibits high arithmetic intensity, has a large market presence, and with the increasing computation power being added to it each year through innovations, the GPU is a perfect candidate to complement the CPU in performing computations. The GPU follows the single instruction multiple data (SIMD) model for applying operations on its data. This model allows the GPU to be very useful for assisting the CPU in performing computations on data that is highly parallel in nature. The compute unified device architecture (CUDA) is a parallel computing and programming platform for NVIDIA GPUs. The main focus of this project is to show the power, speed, and performance of a CUDA-enabled GPU for digital video watermark insertion in the H.264 video compression domain. Digital video watermarking in general is a highly computationally intensive process that is strongly dependent on the video compression format in place. The H.264/MPEG-4 AVC video compression format has high compression efficiency at the expense of having high computational complexity and leaving little room for an imperceptible watermark to be inserted. Employing a human visual model to limit distortion and degradation of visual quality introduced by the watermark is a good choice for designing a video watermarking algorithm though this does introduce more computational complexity to the algorithm. Research is being conducted into how the CPU-GPU execution of the digital watermark application can boost the speed of the applications several times compared to running the application on a standalone CPU using NVIDIA visual profiler to optimize the application.

Performance prediction of application executed on GPUs using a simple analytical model and machine learning techniques / Predição de desempenho de aplicações executadas em GPUs usando um modelo analítico simples e técnicas de aprendizado de máquina

González, Marcos Tulio Amarís 25 June 2018 (has links)
The parallel and distributed platforms of High Performance Computing available today have became more and more heterogeneous (CPUs, GPUs, FPGAs, etc). Graphics Processing Units (GPU) are specialized co-processor to accelerate and improve the performance of parallel vector operations. GPUs have a high degree of parallelism and can execute thousands or millions of threads concurrently and hide the latency of the scheduler. GPUs have a deep hierarchical memory of different types as well as different configurations of these memories. Performance prediction of applications executed on these devices is a great challenge and is essential for the efficient use of resources in machines with these co-processors. There are different approaches for these predictions, such as analytical modeling and machine learning techniques. In this thesis, we present an analysis and characterization of the performance of applications executed on GPUs. We propose a simple and intuitive BSP-based model for predicting the CUDA application execution times on different GPUs. The model is based on the number of computations and memory accesses of the GPU, with additional information on cache usage obtained from profiling. We also compare three different Machine Learning (ML) approaches: Linear Regression, Support Vector Machines and Random Forests with BSP-based analytical model. This comparison is made in two contexts, first, data input or features for ML techniques were the same than analytical model, and, second, using a process of feature extraction, using correlation analysis and hierarchical clustering. We show that GPU applications that scale regularly can be predicted with simple analytical models, and an adjusting parameter. This parameter can be used to predict these applications in other GPUs. We also demonstrate that ML approaches provide reasonable predictions for different cases and ML techniques required no detailed knowledge of application code, hardware characteristics or explicit modeling. Consequently, whenever a large data set with information about similar applications are available or it can be created, ML techniques can be useful for deploying automated on-line performance prediction for scheduling applications on heterogeneous architectures with GPUs. / As plataformas paralelas e distribuídas de computação de alto desempenho disponíveis hoje se tornaram mais e mais heterogêneas (CPUs, GPUs, FPGAs, etc). As Unidades de processamento gráfico são co-processadores especializados para acelerar operações vetoriais em paralelo. As GPUs têm um alto grau de paralelismo e conseguem executar milhares ou milhões de threads concorrentemente e ocultar a latência do escalonador. Elas têm uma profunda hierarquia de memória de diferentes tipos e também uma profunda configuração da memória hierárquica. A predição de desempenho de aplicações executadas nesses dispositivos é um grande desafio e é essencial para o uso eficiente dos recursos computacionais de máquinas com esses co-processadores. Existem diferentes abordagens para fazer essa predição, como técnicas de modelagem analítica e aprendizado de máquina. Nesta tese, nós apresentamos uma análise e caracterização do desempenho de aplicações executadas em Unidades de Processamento Gráfico de propósito geral. Nós propomos um modelo simples e intuitivo fundamentado no modelo BSP para predizer a execução de funções kernels de CUDA sobre diferentes GPUs. O modelo está baseado no número de computações e acessos à memória da GPU, com informação adicional do uso das memórias cachês obtidas do processo de profiling. Nós também comparamos três diferentes enfoques de aprendizado de máquina (ML): Regressão Linear, Máquinas de Vetores de Suporte e Florestas Aleatórias com o nosso modelo analítico proposto. Esta comparação é feita em dois diferentes contextos, primeiro, dados de entrada ou features para as técnicas de aprendizado de máquinas eram as mesmas que no modelo analítico, e, segundo, usando um processo de extração de features, usando análise de correlação e clustering hierarquizado. Nós mostramos que aplicações executadas em GPUs que escalam regularmente podem ser preditas com modelos analíticos simples e um parâmetro de ajuste. Esse parâmetro pode ser usado para predizer essas aplicações em outras GPUs. Nós também demonstramos que abordagens de ML proveem predições aceitáveis para diferentes casos e essas abordagens não exigem um conhecimento detalhado do código da aplicação, características de hardware ou modelagens explícita. Consequentemente, sempre e quando um banco de dados com informação de \\textit esteja disponível ou possa ser gerado, técnicas de ML podem ser úteis para aplicar uma predição automatizada de desempenho para escalonadores de aplicações em arquiteturas heterogêneas contendo GPUs.

Utilização de técnicas de GPGPU em sistema de vídeo-avatar. / Use of GPGPU techniques in a video-avatar system.

Tsuda, Fernando 01 December 2011 (has links)
Este trabalho apresenta os resultados da pesquisa e da aplicação de técnicas de GPGPU (General-Purpose computation on Graphics Processing Units) sobre o sistema de vídeo-avatar com realidade aumentada denominado AVMix. Com o aumento da demanda por gráficos tridimensionais interativos em tempo real cada vez mais próximos da realidade, as GPUs (Graphics Processing Units) evoluíram até o estado atual, como um hardware com alto poder computacional que permite o processamento de algoritmos paralelamente sobre um grande volume de dados. Desta forma, É possível usar esta capacidade para aumentar o desempenho de algoritmos usados em diversas áreas, tais como a área de processamento de imagens e visão computacional. A partir das pesquisas de trabalhos semelhantes, definiu-se o uso da arquitetura CUDA (Computer Unified Device Architecture) da Nvidia, que facilita a implementação dos programas executados na GPU e ao mesmo tempo flexibiliza o seu uso, expondo ao programador o detalhamento de alguns recursos de hardware, como por exemplo a quantidade de processadores alocados e os diferentes tipos de memória. Após a reimplementação das rotinas críticas ao desempenho do sistema AVMix (mapa de profundidade, segmentação e interação), os resultados mostram viabilidade do uso da GPU para o processamento de algoritmos paralelos e a importância da avaliação do algoritmo a ser implementado em relação a complexidade do cálculo e ao volume de dados transferidos entre a GPU e a memória principal do computador. / This work presents the results of research and application of GPGPU (General-Purpose computation on Graphics Processing Units) techniques on the video-avatar system with augmented reality called AVMix. With increasing demand for interactive three-dimensional graphics rendered in real-time and closer to reality, GPUs (Graphics Processing Units) evolved to the present state as a high-powered computing hardware enabled to process parallel algorithms over a large data set. This way, it is possible to use this capability to increase the performance of algorithms used in several areas, such as image processing and computer vision. From the research of similar work, it is possible to define the use of CUDA (Computer Unified Device Architecture) from Nvidia, which facilitates the implementation of the programs that run on GPU and at the same time flexibilize its use, exposing to the programmer some details of hardware such as the number of processors allocated and the different types of memory. Following the reimplementation of critical performance routines of AVMix system (depth map, segmentation and interaction), the results show the viability of using the GPU to process parallel algorithms in this application and the importance of evaluating the algorithm to be implemented, considering the complexity of the calculation and the volume of data transferred between the GPU and the computer\'s main memory.

Utilização de técnicas de GPGPU em sistema de vídeo-avatar. / Use of GPGPU techniques in a video-avatar system.

Fernando Tsuda 01 December 2011 (has links)

Fernando Tsuda 01 December 2011 (has links)
Este trabalho apresenta os resultados da pesquisa e da aplicação de técnicas de GPGPU (General-Purpose computation on Graphics Processing Units) sobre o sistema de vídeo-avatar com realidade aumentada denominado AVMix. Com o aumento da demanda por gráficos tridimensionais interativos em tempo real cada vez mais próximos da realidade, as GPUs (Graphics Processing Units) evoluíram até o estado atual, como um hardware com alto poder computacional que permite o processamento de algoritmos paralelamente sobre um grande volume de dados. Desta forma, É possível usar esta capacidade para aumentar o desempenho de algoritmos usados em diversas áreas, tais como a área de processamento de imagens e visão computacional. A partir das pesquisas de trabalhos semelhantes, definiu-se o uso da arquitetura CUDA (Computer Unified Device Architecture) da Nvidia, que facilita a implementação dos programas executados na GPU e ao mesmo tempo flexibiliza o seu uso, expondo ao programador o detalhamento de alguns recursos de hardware, como por exemplo a quantidade de processadores alocados e os diferentes tipos de memória. Após a reimplementação das rotinas críticas ao desempenho do sistema AVMix (mapa de profundidade, segmentação e interação), os resultados mostram viabilidade do uso da GPU para o processamento de algoritmos paralelos e a importância da avaliação do algoritmo a ser implementado em relação a complexidade do cálculo e ao volume de dados transferidos entre a GPU e a memória principal do computador. / This work presents the results of research and application of GPGPU (General-Purpose computation on Graphics Processing Units) techniques on the video-avatar system with augmented reality called AVMix. With increasing demand for interactive three-dimensional graphics rendered in real-time and closer to reality, GPUs (Graphics Processing Units) evolved to the present state as a high-powered computing hardware enabled to process parallel algorithms over a large data set. This way, it is possible to use this capability to increase the performance of algorithms used in several areas, such as image processing and computer vision. From the research of similar work, it is possible to define the use of CUDA (Computer Unified Device Architecture) from Nvidia, which facilitates the implementation of the programs that run on GPU and at the same time flexibilize its use, exposing to the programmer some details of hardware such as the number of processors allocated and the different types of memory. Following the reimplementation of critical performance routines of AVMix system (depth map, segmentation and interaction), the results show the viability of using the GPU to process parallel algorithms in this application and the importance of evaluating the algorithm to be implemented, considering the complexity of the calculation and the volume of data transferred between the GPU and the computer\'s main memory.

Lygiagretieji skaičiavimai naudojant vaizdo plokštes / Parallel computing using graphics cards

Juodaitis, Robertas 01 August 2013 (has links)
Šiame darbe lyginami vaizdo plokštės ir MPI lygiagrečiųjų skaičiavimų pajėgumai klasikiniais lygiagretinimo algoritmais: apytikslės π reikšmės skaičiavimo, matricų daugybos. Daug dėmesio skiriama uždavinių lygiagretinimo strategijos parinkimui, efektyviai išnaudoti tiek MPI klasterį, tiek vaizdo plokštę. Nustatytas tinkamas šių įrenginių palyginimui kriterijus – santykinis pagreitėjimas, objektyviai nusakantis, kokį skaičiavimo pajėgumą pasiekia vaizdo plokštė prieš centrinį procesorių. Išanalizavus eksperimentų rezultatus nustatyta, kad programuotojas turi siekti mažesnio duomenų apsikeitimo tarp procesų, nes komunikavimas mažina lygiagrečiųjų algoritmų efektyvumą. Taip pat nustatyta, kad programavimas Cuda reikalauja griežto prisitaikymo prie vaizdo plokštės parametrų ir yra sudėtingesnis. Kaip rezultatas - pilnai apkrauta vaizdo plokštė su Cuda yra spartesnė ne tik už kompiuterius su 4 branduolių procesoriumi, bet ir nedidelį klasterį. / This work compares two different kinds of computing devices – video card and central processor unit for general purpose computing in parallel. MPI library used for central processor unit, Cuda used for video card, compute classic parallel algorithm approximate π value and matrix multiplication. Our main attention - better strategies working with MPI cluster and Cuda to completely utilize these two kind computing resources. There are found objective method to compare video card and central processor unit computing advantages – relative speedup. After analyze experiment result there are found some advice for programmer. Programmers must find the ways to communicate between processes more rarely, because communication lowers efficiency of parallel algorithm. Programming with Cuda requires much more skills and flexibility to work efficiency with video card device. As a result fully utilized video card with Cuda is faster than computer with 4 cores CPU and little cluster.

Vers une simulation par éléments finis en temps réel pour le génie électrique / Towards a real-time simulation by finite elements for electrical engineering

Dinh, Van Quang 15 December 2016 (has links)
Les phénomènes physiques dans le domaine de génie électrique sont basés sur les équations de Maxwell qui sont des équations aux dérivés partielles dont les solutions sont des fonctions s’appuyant sur les propriétés des matériaux et vérifiant certaines conditions aux limites du domaine d’étude. La méthode des éléments finis (MEF) est la méthode la plus couramment utilisée pour calculer les solutions de ces équations et en déduire les champs et inductions magnétiques et électriques. De nos jours, le calcul parallèle GPU (Graphic Processor Unit) présente un potentiel important de performance à destination du calcul numérique par rapport au calcul traditionnel par CPU. Le calcul par GPU consiste à utiliser un processeur graphique (Graphic Processor Unit) en complément du CPU pour accélérer les applications en sciences et en ingénierie. Le calcul par GPU permet de paralléliser massivement les tâches et d'offrir ainsi un maximum de performances en accélérant les portions de code les plus lourdes, le reste de l'application restant affectée au CPU. Cette thèse s’inscrit dans le contexte de modélisation dans le domaine de génie électrique utilisant la méthode des éléments finis. L’objectif de la thèse est d’améliorer la performance de la MEF, voire d’en changer les modes d’utilisation en profitant de la grande performance du calcul parallèle sur GPU. En effet, si grâce au GPU, le calcul parvenait à s’effectuer en quasi temps réel, les outils de simulation deviendraient alors des outils de conception intuitifs, qui permettraient par exemple de « sentir » la sensibilité d’un dimensionnement à la modification de paramètres géométriques ou physiques. Un nouveau champ d’utilisation des codes de simulation s’ouvrirait alors. C’est le fil conducteur de ce travail, qui tente, en abordant les différentes phases d’une simulation par la MEF, de les accélérer au maximum, pour rendre l’ensemble quasi instantané. Ainsi dans cette thèse, les phases de maillage, intégration, résolution et exploitation sont abordées successivement. Pour chacune de ces grandes étapes de la simulation d’un dispositif, les méthodes de la littérature sont examinées et de nouvelles approches sont proposées. Les performances atteintes sont analysées et comparées au cout de l’implantation traditionnelle sur CPU ; Les détails d’implantation sont décrits assez finement, car la performance globale des approches sur GPU sont très liés à ces choix. / The physical phenomena in the electrical engineering field are based on Maxwell's equations in which solutions are functions verifying the material properties and satisfying certain boundary conditions on the field. The finite element method (FEM) is the most commonly used method to calculate the solutions of these equations and deduce the magnetic and electric fields.Nowadays, the parallel computing on graphics processors offers a very high computing performance over traditional calculation by CPU. The GPU-accelerated computing makes use of a graphics processing unit (GPU) together with a CPU to accelerate many applications in science and engineering. It enables massively parallelized tasks and thus accelerate the performance by offloading the compute-intensive portions of the application to the GPU while the remainder of the application still runs on the CPU.The thesis deals with the modeling in the magnetic field using the finite element method. The aim of the thesis is to improve the performance of the MEF by taking advantage of the high performance parallel computing on the GPU. Thus if the calculation can be performed in near real-time, the simulation tools would become an intuitive design tool which allow for example to "feel" the sensitivity of a design modification of geometric and physical parameters. A new field of use of simulation codes would open. This is the theme of this work, which tries to accelerate the different phases of a simulation to make the whole almost instantaneous. So in this thesis, the meshing, the numerical integration, the assembly, the resolution and the post processing are discussed respectively. For each phase, the methods in the literature are examined and new approaches are proposed. The performances are analyzed and compared. The implementation details are described as the overall performance of GPU approaches are closely linked to these choices.

Méthodes numériques pour la résolution accélérée des systèmes linéaires de grandes tailles sur architectures hybrides massivement parallèles / Numerical methods for the accelerated resolution of large scale linear systems on massively parallel hybrid architecture

Cheik Ahamed, Abal-Kassim 07 July 2015 (has links)
Les progrès en termes de puissance de calcul ont entraîné de nombreuses évolutions dans le domaine de la science et de ses applications. La résolution de systèmes linéaires survient fréquemment dans le calcul scientifique, comme par exemple lors de la résolution d'équations aux dérivées partielles par la méthode des éléments finis. Le temps de résolution découle alors directement des performances des opérations algébriques mises en jeu.Cette thèse a pour but de développer des algorithmes parallèles innovants pour la résolution de systèmes linéaires creux de grandes tailles. Nous étudions et proposons comment calculer efficacement les opérations d'algèbre linéaire sur plateformes de calcul multi-coeur hétérogènes-GPU afin d'optimiser et de rendre robuste la résolution de ces systèmes. Nous proposons de nouvelles techniques d'accélération basées sur la distribution automatique (auto-tuning) des threads sur la grille GPU suivant les caractéristiques du problème et le niveau d'équipement de la carte graphique, ainsi que les ressources disponibles. Les expérimentations numériques effectuées sur un large spectre de matrices issues de divers problèmes scientifiques, ont clairement montré l'intérêt de l'utilisation de la technologie GPU, et sa robustesse comparée aux bibliothèques existantes comme Cusp.L'objectif principal de l'utilisation du GPU est d'accélérer la résolution d'un problème dans un environnement parallèle multi-coeur, c'est-à-dire "Combien de temps faut-il pour résoudre le problème?". Dans cette thèse, nous nous sommes également intéressés à une autre question concernant la consommation énergétique, c'est-à-dire "Quelle quantité d'énergie est consommée par l'application?". Pour répondre à cette seconde question, un protocole expérimental est établi pour mesurer la consommation d'énergie d'un GPU avec précision pour les opérations fondamentales d'algèbre linéaire. Cette méthodologie favorise une "nouvelle vision du calcul haute performance" et apporte des réponses à certaines questions rencontrées dans l'informatique verte ("green computing") lorsque l'on s'intéresse à l'utilisation de processeurs graphiques.Le reste de cette thèse est consacré aux algorithmes itératifs synchrones et asynchrones pour résoudre ces problèmes dans un contexte de calcul hétérogène multi-coeur-GPU. Nous avons mis en application et analysé ces algorithmes à l'aide des méthodes itératives basées sur les techniques de sous-structurations. Dans notre étude, nous présentons les modèles mathématiques et les résultats de convergence des algorithmes synchrones et asynchrones. La démonstration de la convergence asynchrone des méthodes de sous-structurations est présentée. Ensuite, nous analysons ces méthodes dans un contexte hybride multi-coeur-GPU, qui devrait ouvrir la voie vers les méthodes hybrides exaflopiques.Enfin, nous modifions la méthode de Schwarz sans recouvrement pour l'accélérer à l'aide des processeurs graphiques. La mise en oeuvre repose sur l'accélération par les GPUs de la résolution locale des sous-systèmes linéaires associés à chaque sous-domaine. Pour améliorer les performances de la méthode de Schwarz, nous avons utilisé des conditions d'interfaces optimisées obtenues par une technique stochastique basée sur la stratégie CMA-ES (Covariance Matrix Adaptation Evolution Strategy). Les résultats numériques attestent des bonnes performances, de la robustesse et de la précision des algorithmes synchrones et asynchrones pour résoudre de grands systèmes linéaires creux dans un environnement de calcul hétérogène multi-coeur-GPU. / Advances in computational power have led to many developments in science and its applications. Solving linear systems occurs frequently in scientific computing, as in the finite element discretization of partial differential equations. The running time of the overall resolution is a direct result of the performance of the involved algebraic operations.In this dissertation, different ways of efficiently solving large and sparse linear systems are put forward. We present the best way to effectively compute linear algebra operations in an heterogeneous multi-core-GPU environment in order to make solvers such as iterative methods more robust and therefore reduce the computing time of these systems. We propose new techniques to speed algorithms up the auto-tuning of the threading design, according to the problem characteristics and the equipment level in the hardware and available resources. Numerical experiments performed on a set of large-size sparse matrices arising from diverse engineering and scientific problems, have clearly shown the benefit of the use of GPU technology to solve large sparse systems of linear equations, and its robustness and accuracy compared to existing libraries such as Cusp.The main priority of the GPU program is computational time to obtain the solution in a parallel environment, i.e, "How much time is needed to solve the problem?". In this thesis, we also address another question regarding energy issues, i.e., "How much energy is consumed by the application?". To answer this question, an experimental protocol is established to measure the energy consumption of a GPU for fundamental linear algebra operations accurately. This methodology fosters a "new vision of high-performance computing" and answers some of the questions outlined in green computing when using GPUs.The remainder of this thesis is devoted to synchronous and asynchronous iterative algorithms for solving linear systems in the context of a multi-core-GPU system. We have implemented and analyzed these algorithms using iterative methods based on sub-structuring techniques. Mathematical models and convergence results of synchronous and asynchronous algorithms are presented here, as are the convergence results of the asynchronous sub-structuring methods. We then analyze these methods in the context of a hybrid multi-core-GPU, which should pave the way for exascale hybrid methods.Lastly, we modify the non-overlapping Schwarz method to accelerate it, using GPUs. The implementation is based on the acceleration of the local solutions of the linear sub-systems associated with each sub-domain using GPUs. To ensure good performance, optimized conditions obtained by a stochastic technique based on the Covariance Matrix Adaptation Evolution Strategy (CMA-ES) are used. Numerical results illustrate the good performance, robustness and accuracy of synchronous and asynchronous algorithms to solve large sparse linear systems in the context of an heterogeneous multi-core-GPU system.

