Global ETD Search

121	Ordonnancement dynamique, adapté aux architectures hétérogènes, de la méthode multipôle pour les équations de Maxwell, en électromagnétisme Bordage, Cyril 20 December 2013 (has links) La méthode multipôle permet d'accélérer les produits matrices-vecteurs, utilisés par les solveurs itératifs pour déterminer le comportement électromagnétique, d'un objet soumis à une onde incidente. Nos travaux ont pour but d'adapter cette méthode pour la rendre efficace sur les architectures hétérogènes contenant des GPU. Pour cela, nous utilisons une ordonnanceur dynamique, StarPU, qui effectuera la distribution des tâches de calcul au sein d'un nœud. Pour la parallélisation en mémoire distribuée, nous effectuerons un ordonnancement statique des boîtes, couplé à un ordonnancement dynamique des interactions proches. / The Fast Multipole Method can speed up matrix-vector products, found in iterative solvers in order to compute the electromagnetics response of an object subject to an incident wave. We have intended to adapt this method to make it effective on heterogeneous architectures with GPUs. For this purpose, we use a dynamic scheduler named StarPU, which distributes the tasks within a node. For the parallelization in distributed memory, we distribute the tasks statically but we distribute the near interactions dynamically.. Méthode multipôle Fmm Ordonnancement dynamique Électromagnétique Helmholtz Mpi StarPU Cuda Fast multipole method Fmm Dynamic scheduling Electromagnetics Helmholtz Mpi StarPU Cuda
122	Trace-based Performance Analysis for Hardware Accelerators / Leistungsanalyse hardwarebeschleunigter Anwendungen mittels Programmspuren Juckeland, Guido 14 February 2013 (has links) (PDF) This thesis presents how performance data from hardware accelerators can be included in event logs. It extends the capabilities of trace-based performance analysis to also monitor and record data from this novel parallelization layer. The increasing awareness to power consumption of computing devices has led to an interest in hybrid computing architectures as well. High-end computers, workstations, and mobile devices start to employ hardware accelerators to offload computationally intense and parallel tasks, while at the same time retaining a highly efficient scalar compute unit for non-parallel tasks. This execution pattern is typically asynchronous so that the scalar unit can resume other work while the hardware accelerator is busy. Performance analysis tools provided by the hardware accelerator vendors cover the situation of one host using one device very well. Yet, they do not address the needs of the high performance computing community. This thesis investigates ways to extend existing methods for recording events from highly parallel applications to also cover scenarios in which hardware accelerators aid these applications. After introducing a generic approach that is suitable for any API based acceleration paradigm, the thesis derives a suggestion for a generic performance API for hardware accelerators and its implementation with NVIDIA CUPTI. In a next step the visualization of event logs containing data from execution streams on different levels of parallelism is discussed. In order to overcome the limitations of classic performance profiles and timeline displays, a graph-based visualization using Parallel Performance Flow Graphs (PPFGs) is introduced. This novel technical approach is using program states in order to display similarities and differences between the potentially very large number of event streams and, thus, enables a fast way to spot load imbalances. The thesis concludes with the in-depth analysis of a case-study of PIConGPU---a highly parallel, multi-hybrid plasma physics simulation---that benefited greatly from the developed performance analysis methods. / Diese Dissertation zeigt, wie der Ablauf von Anwendungsteilen, die auf Hardwarebeschleuniger ausgelagert wurden, als Programmspur mit aufgezeichnet werden kann. Damit wird die bekannte Technik der Leistungsanalyse von Anwendungen mittels Programmspuren so erweitert, dass auch diese neue Parallelitätsebene mit erfasst wird. Die Beschränkungen von Computersystemen bezüglich der elektrischen Leistungsaufnahme hat zu einer steigenden Anzahl von hybriden Computerarchitekturen geführt. Sowohl Hochleistungsrechner, aber auch Arbeitsplatzcomputer und mobile Endgeräte nutzen heute Hardwarebeschleuniger um rechenintensive, parallele Programmteile auszulagern und so den skalaren Hauptprozessor zu entlasten und nur für nicht parallele Programmteile zu verwenden. Dieses Ausführungsschema ist typischerweise asynchron: der Skalarprozessor kann, während der Hardwarebeschleuniger rechnet, selbst weiterarbeiten. Die Leistungsanalyse-Werkzeuge der Hersteller von Hardwarebeschleunigern decken den Standardfall (ein Host-System mit einem Hardwarebeschleuniger) sehr gut ab, scheitern aber an einer Unterstützung von hochparallelen Rechnersystemen. Die vorliegende Dissertation untersucht, in wie weit auch multi-hybride Anwendungen die Aktivität von Hardwarebeschleunigern aufzeichnen können. Dazu wird die vorhandene Methode zur Erzeugung von Programmspuren für hochparallele Anwendungen entsprechend erweitert. In dieser Untersuchung wird zuerst eine allgemeine Methodik entwickelt, mit der sich für jede API-gestützte Hardwarebeschleunigung eine Programmspur erstellen lässt. Darauf aufbauend wird eine eigene Programmierschnittstelle entwickelt, die es ermöglicht weitere leistungsrelevante Daten aufzuzeichnen. Die Umsetzung dieser Schnittstelle wird am Beispiel von NVIDIA CUPTI darstellt. Ein weiterer Teil der Arbeit beschäftigt sich mit der Darstellung von Programmspuren, welche Aufzeichnungen von den unterschiedlichen Parallelitätsebenen enthalten. Um die Einschränkungen klassischer Leistungsprofile oder Zeitachsendarstellungen zu überwinden, wird mit den parallelen Programmablaufgraphen (PPFGs) eine neue graphenbasisierte Darstellungsform eingeführt. Dieser neuartige Ansatz zeigt eine Programmspur als eine Folge von Programmzuständen mit gemeinsamen und unterchiedlichen Abläufen. So können divergierendes Programmverhalten und Lastimbalancen deutlich einfacher lokalisiert werden. Die Arbeit schließt mit der detaillierten Analyse von PIConGPU -- einer multi-hybriden Simulation aus der Plasmaphysik --, die in großem Maße von den in dieser Arbeit entwickelten Analysemöglichkeiten profiert hat. Leistungsanalyse Hardwarebeschleuniger GPUs CUDA OpenCL Tracing Particle-in-Cell Performance Analysis Hardware accelerators GPUs CUDA OpenCL Tracing Particle-in-Cell ddc:004 rvk:ST 150
123	Estudo para otimização do algoritmo Non-local means visando aplicações em tempo real Silva, Hamilton Soares da 25 July 2014 (has links) Made available in DSpace on 2015-05-08T14:59:57Z (GMT). No. of bitstreams: 1 arquivototal.pdf: 3935872 bytes, checksum: 5a4c90590e53b3ea1d71bbe61a628b56 (MD5) Previous issue date: 2014-07-25 / Coordenação de Aperfeiçoamento de Pessoal de Nível Superior / The aim of this work is to study the non-local means algorithm and propose techniques to optimize and implement this algorithm for its application in real-time. Two alternatives are suggested for implementation. The first deals with the development of an accelerator card for computers, which has a PCI bus containing specialized hardware that implements the NLM filter. The second implementation uses densely GPU multiprocessor environment, which exists in the parent video. Both proposals significantly accelerates the NLM algorithm, while maintains the same visual quality of traditional software implementations, enabling real-time use. Image denoising is an important area for digital image processing. Recently, its use is becoming more popular due to improvements of of the new acquisition equipments and, thus, the increase of image resolution that favors the occurrence of such perturbations. It is widely studied in the fields of image processing, computer vision and predictive maintenance of electrical substations, motors, tires, building facilities, pipes and fittings, focusing on reducing the noise without removing details of the original image. Several approaches have been proposed for filtering noise. One of such approaches is the non-local method called Non-Local Means (NLM), which uses the entire image rather than local information and stands out as the state of the art. However, a problem in this method is its high computational complexity, which turns its application almost impossible in real time applications, even for small images / O propósito deste trabalho é estudar o algoritmo non-local means(NLM) e propor técnicas para otimizar e implementar o referido algoritmo visando sua aplicação em tempo real. Ao todo são sugeridas duas alternativas de implementação. A primeira trata do desenvolvimento de uma placa aceleradora para computadores que possuam Barramento PCI, contendo um hardware especializado que implementa o Filtro NLM. A segunda implementação utiliza o ambiente densamente multiprocessado GPU, existente nas controladoras de vídeo. As duas propostas aceleraram significativamente o algoritmo NLM, mantendo a mesma qualidade visual das implementações tradicionais em software, tornando possível sua utilização em tempo real. A filtragem de ruídos é uma área importante para o processamento digital de imagens, sendo cada vez mais utilizada devido as melhorias dos novos equipamentos de captação, e o consequente aumento da resolução da imagem, que favorece o aparecimento dessas perturbações. Ela é amplamente estudada nos campos de tratamento de imagens, visão computacional e manutenção preditiva de subestações elétricas, motores, pneus, instalações prediais, tubos e conexões, focando em reduzir os ruídos sem que se remova os detalhes da imagem original. Várias abordagens foram propostas para filtragem de ruídos, uma delas é o método não-local, chamado de Non-Local Means (NLM), que não só utiliza as informações locais, mas a imagem inteira, destaca-se como o estado da arte, porém, há um problema neste método, que é a sua alta complexidade computacional, que o torna praticamente inviável de ser utilizado em aplicações em tempo real, até mesmo para imagens pequenas Processamento Digital de Imagens Redução de Ruído Computação Reconfigurável Computação Paralela Programação GPU CUDA Digital Image Processing Image Denoising Reconfigurable Computing Parallel Computing Programming CUDA GPU CNPQ::ENGENHARIAS::ENGENHARIA MECANICA
124	Implementa??o do algoritmo (RTM) para processamento s?smico em arquiteturas n?o convencionais Lima, Igo Pedro de 16 June 2014 (has links) Made available in DSpace on 2014-12-17T14:08:57Z (GMT). No. of bitstreams: 1 IgoPL_DISSERT.pdf: 1338632 bytes, checksum: 5c21a0cb714155a0e215d803dca007ce (MD5) Previous issue date: 2014-06-16 / Coordena??o de Aperfei?oamento de Pessoal de N?vel Superior / With the growth of energy consumption worldwide, conventional reservoirs, the reservoirs called "easy exploration and production" are not meeting the global energy demand. This has led many researchers to develop projects that will address these needs, companies in the oil sector has invested in techniques that helping in locating and drilling wells. One of the techniques employed in oil exploration process is the reverse time migration (RTM), in English, Reverse Time Migration, which is a method of seismic imaging that produces excellent image of the subsurface. It is algorithm based in calculation on the wave equation. RTM is considered one of the most advanced seismic imaging techniques. The economic value of the oil reserves that require RTM to be localized is very high, this means that the development of these algorithms becomes a competitive differentiator for companies seismic processing. But, it requires great computational power, that it still somehow harms its practical success. The objective of this work is to explore the implementation of this algorithm in unconventional architectures, specifically GPUs using the CUDA by making an analysis of the difficulties in developing the same, as well as the performance of the algorithm in the sequential and parallel version / Com o crescimento do consumo energ?tico em todo o mundo, os reservat?rios convencionais, chamados de reservat?rios de f?cil explora??o e produ??o n?o est?o atendendo a demanda energ?tica mundial. Isso tem levado muitos pesquisadores a desenvolver trabalhos que venham sanar essas car?ncias. Empresas do setor petrol?fero tem investido em t?cnicas que ajudem na localiza??o e perfura??o de po?os. Uma das t?cnicas empregadas no processo de explora??o de petr?leo ? a Migra??o Reversa no Tempo (RTM), do ingl?s, Reverse Time Migration, que ? um m?todo de imageamento s?smico que produz excelente imagem de subsuperf?cie. ? um algoritmo baseado no c?lculo da equa??o de onda. A RTM ? considerada uma das t?cnicas mais avan?adas de imageamento s?smico. O valor econ?mico das reservas de petr?leo que requerem RTM para ser localizada ? muito alto, isso significa que o desenvolvimento desses algoritmos torna-se um diferencial competitivo para as empresas de processamento s?smico. No entanto, o mesmo requer grande poder computacional que, de alguma forma, ainda prejudica o seu sucesso pr?tico. Assim, o objetivo deste trabalho ? explorar a implementa??o desse algoritmo em arquiteturas n?o convencionais, especificamente as GPUs, utilizando a plataforma CUDA, fazendo uma an?lise das dificuldades no desenvolvimento do mesmo, bem como a performance do algoritmo na vers?o sequencial e paralela
125	Reconstrução de imagens por tomografia por impedância elétrica utilizando recozimento simulado massivamente paralelizado. / Image reconstruction through electrical impedance tomography using massively parallelized simulated annealing. Renato Seiji Tavares 06 May 2016 (has links) A tomografia por impedância elétrica é uma modalidade de imageamento médico recente, com diversas vantagens sobre as demais modalidades já consolidadas. O recozimento simulado é um algoritmo que apresentada qualidade de solução, mesmo com a utilização de uma regularização simples e sem informação a priori. Entretanto, existe a necessidade de reduzir o tempo de processamento. Este trabalho avança nessa direção, com a apresentação de um método de reconstrução que utiliza o recozimento simulado e paralelização massiva em GPU. A paralelização das operações matriciais em GPU é explicada, com uma estratégia de agendamento de threads que permite a paralelização efetiva de algoritmos, até então, considerados não paralelizáveis. Técnicas para sua aceleração são discutidas, como a heurística de fora para dentro. É proposta uma nova representação de matrizes esparsas voltada para as características da arquitetura CUDA, visando um melhor acesso à memória global do dispositivo e melhor utilização das threads. Esta nova representação de matriz mostrou-se vantajosa em relação aos formatos mais utilizados. Em seguida, a paralelização massiva do problema inverso da TIE, utilizando recozimento simulado, é estudada, com uma proposta de abordagem híbrida com paralelização tanto em CPU quanto GPU. Os resultados obtidos para a paralelização do problema inverso são superiores aos do problema direto. A GPU satura em aproximadamente 7.000 nós, a partir do qual o ganho em desempenho é de aproximadamente 5 vezes. A utilização de GPUs é viável para a reconstrução de imagens de tomografia por impedância elétrica. / Electrical impedance tomography is a new medical imaging modality with remarkable advatanges over other stablished modalities. Simulated annealing is an algorithm that renders quality solutions despite the use of simple regularization methods and the absence of a priori information. However, it remains the need to reduce its processing time. This work takes a step in this direction, presenting a method for the reconstruction of EIT images using simulated annealing and GPU parallelization. The parallelization of matrix operations in GPU is explained, with a thread scheduling strategy that allows the effective parallelization of not-yet effectively parallelized algorithms. There are strategies for improving its performance, such as the presented outside-in heuristic. It is proposed a new sparse matrix representation focused on the CUDA architecture characteristics, with improved global memory access patterns and thread efficiency. This new matrix representation showed several advantages over the most common formats. The massive parallelization of the TIE\'s inverse problem using simulated annealing is studied, with a proposed hybrid approach that uses parallelization in both CPU and GPU. Results showed that the performance gain for the inverse problem is higher than the one obtained for the forward problem. The GPU device saturates with meshes of size of approximately 7,000 nodes, with a performance gain around 5 times faster than serial implementations. GPU parallelization may be used for the reconstruction of electrical impedance tomography images. Algoritmos paralelos CUDA GPU Otimização estocástica Problemas inversos Processamento de imagens Recozimento simulado Tomografia CUDA Electrical impedance tomography GPU Parallelized algorithms Simulated annealing
126	Resolução de um problema térmico inverso utilizando processamento paralelo em arquiteturas de memória compartilhada / Resolution of an inverse thermal problem using parallel processing on shared memory architectures Jonas Laerte Ansoni 03 September 2010 (has links) A programação paralela tem sido freqüentemente adotada para o desenvolvimento de aplicações que demandam alto desempenho computacional. Com o advento das arquiteturas multi-cores e a existência de diversos níveis de paralelismo é importante definir estratégias de programação paralela que tirem proveito desse poder de processamento nessas arquiteturas. Neste contexto, este trabalho busca avaliar o desempenho da utilização das arquiteturas multi-cores, principalmente o oferecido pelas unidades de processamento gráfico (GPUs) e CPUs multi-cores na resolução de um problema térmico inverso. Algoritmos paralelos para a GPU e CPU foram desenvolvidos utilizando respectivamente as ferramentas de programação em arquiteturas de memória compartilhada NVIDIA CUDA (Compute Unified Device Architecture) e a API POSIX Threads. O algoritmo do método do gradiente conjugado pré-condicionado para resolução de sistemas lineares esparsos foi implementado totalmente no espaço da memória global da GPU em CUDA. O algoritmo desenvolvido foi avaliado em dois modelos de GPU, os quais se mostraram mais eficientes, apresentando um speedup de quatro vezes que a versão serial do algoritmo. A aplicação paralela em POSIX Threads foi avaliada em diferentes CPUs multi-cores com distintas microarquiteturas. Buscando um maior desempenho do código paralelizado foram utilizados flags de otimização as quais se mostraram muito eficientes na aplicação desenvolvida. Desta forma o código paralelizado com o auxílio das flags de otimização chegou a apresentar tempos de processamento cerca de doze vezes mais rápido que a versão serial no mesmo processador sem nenhum tipo de otimização. Assim tanto a abordagem utilizando a GPU como um co-processador genérico a CPU como a aplicação paralela empregando as CPUs multi-cores mostraram-se ferramentas eficientes para a resolução do problema térmico inverso. / Parallel programming has been frequently adopted for the development of applications that demand high-performance computing. With the advent of multi-cores architectures and the existence of several levels of parallelism are important to define programming strategies that take advantage of parallel processing power in these architectures. In this context, this study aims to evaluate the performance of architectures using multi-cores, mainly those offered by the graphics processing units (GPUs) and CPU multi-cores in the resolution of an inverse thermal problem. Parallel algorithms for the GPU and CPU were developed respectively, using the programming tools in shared memory architectures, NVIDIA CUDA (Compute Unified Device Architecture) and the POSIX Threads API. The algorithm of the preconditioned conjugate gradient method for solving sparse linear systems entirely within the global memory of the GPU was implemented by CUDA. It evaluated the two models of GPU, which proved more efficient by having a speedup was four times faster than the serial version of the algorithm. The parallel application in POSIX Threads was evaluated in different multi-core CPU with different microarchitectures. Optimization flags were used to achieve a higher performance of the parallelized code. As those were efficient in the developed application, the parallelized code presented processing times about twelve times faster than the serial version on the same processor without any optimization. Thus both the approach using GPU as a coprocessor to the CPU as a generic parallel application using the multi-core CPU proved to be more efficient tools for solving the inverse thermal problem. GPGPU CUDA Gradiente conjugado pré-condicionado Matriz esparsa POSIX threads Processamento paralelo GPGPU CUDA Parallel processing POSIX threads Sparse numerical solver
127	Precificação de opções exóticas utilizando CUDA / Exotic options pricing using CUDA Felipe Boteon Calderaro 17 October 2017 (has links) No mercado financeiro, a precificação de contratos complexos muitas vezes apoia-se em técnicas de simulação numérica. Estes métodos de precificação geralmente apresentam baixo desempenho devido ao grande custo computacional envolvido, o que dificulta a análise e a tomada de decisão por parte do trader. O objetivo deste trabalho é apresentar uma ferramenta de alto desempenho para a precificação de instrumentos financeiros baseados em simulações numéricas. A proposta é construir uma calculadora eficiente para a precificação de opções multivariadas baseada no método de Monte Carlo, utilizando a plataforma CUDA de programação paralela. Serão apresentados os conceitos matemáticos que embasam a precificação risco-neutra, tanto no contexto univariado quanto no multivariado. Após isso entraremos em detalhes sobre a implementação da simulação Monte Carlo e a arquitetura envolvida na plataforma CUDA. No final, apresentaremos os resultados obtidos comparando o tempo de execução dos algoritmos. / In the financial market, the pricing of complex contracts often relies on numerical simulation techniques. These pricing methods generally present poor performance due to the large computational cost involved, which makes it difficult for the trader to analyze and make decisions. The objective of this work is to present a high performance tool for the pricing of financial instruments based on numerical simulations. The proposal is to present an efficient calculator for the pricing of multivariate options based on the Monte Carlo method, using the parallel programming CUDA platform. The mathematical concepts underlying risk-neutral pricing, both in the univariate and in the multivariate context, will be presented. After this we will detail the implementation of the Monte Carlo simulation and the architecture involved in the CUDA platform. At the end, we will present the results obtained comparing the execution time of the algorithms. CUDA Geração de números aleatórios Opções exóticas Precificação de derivativos Simulação Monte Carlo CUDA Derivatives pricing Exotic options Monte Carlo Simulation Random number generation
128	Simulações numéricas 3D em ambiente paralelo de hipertermia com nanopartículas magnéticas Reis, Ruy Freitas 05 November 2014 (has links) Submitted by Renata Lopes (renatasil82@gmail.com) on 2017-02-24T15:43:42Z No. of bitstreams: 1 ruyfreitasreis.pdf: 10496081 bytes, checksum: 05695a7e896bd684b83ab5850df95449 (MD5) / Approved for entry into archive by Adriana Oliveira (adriana.oliveira@ufjf.edu.br) on 2017-03-06T19:28:45Z (GMT) No. of bitstreams: 1 ruyfreitasreis.pdf: 10496081 bytes, checksum: 05695a7e896bd684b83ab5850df95449 (MD5) / Made available in DSpace on 2017-03-06T19:28:45Z (GMT). No. of bitstreams: 1 ruyfreitasreis.pdf: 10496081 bytes, checksum: 05695a7e896bd684b83ab5850df95449 (MD5) Previous issue date: 2014-11-05 / CAPES - Coordenação de Aperfeiçoamento de Pessoal de Nível Superior / Este estudo tem como objetivo a modelagem numérica do tratamento de tumores sólidos com hipertermia utilizando nanopartículas magnéticas, considerando o modelo tridimensional de biotransferência de calor proposto por Pennes (1948). Foram comparadas duas diferentes possibilidades de perfusão sanguínea, a primeira constante e, a segunda, dependente da temperatura. O tecido é modelado com as camadas de pele, gordura e músculo, além do tumor. Para encontrar a solução aproximada do modelo foi aplicado o método das diferenças finitas (MDF) em um meio heterogêneo. Devido aos diferentes parâmetros de perfusão, foram obtidos sistemas de equações lineares (perfusão constante) e não lineares (perfusão dependente da temperatura). No domínio do tempo foram utilizados dois esquemas numéricos explícitos, o primeiro utilizando o método clássico de Euler e o segundo um algoritmo do tipo preditor-corretor adaptado dos métodos de integração generalizada da família-alpha trapezoidal. Uma vez que a execução de um modelo tridimensional demanda um alto custo computacional, foram empregados dois esquemas de paralelização do método numérico, o primeiro baseado na API de programação paralela OpenMP e o segundo com a plataforma CUDA. Os resultados experimentais mostraram que a paralelização em OpenMP obteve aceleração de até 39 vezes comparada com a versão serial, e, além disto, a versão em CUDA também foi eficiente, obtendo um ganho de 242 vezes, também comparando-se com o tempo de execução sequencial. Assim, o resultado da execução é obtido cerca de duas vezes mais rápido do que o fenômeno biológico. / This work deals with the numerical modeling of solid tumor treatments with hyperthermia using magnetic nanoparticles considering a 3D bioheat transfer model proposed by Pennes(1948). Two different possibilities of blood perfusion were compared, the first assumes a constant value, and the second one a temperature-dependent function. The living tissue was modeled with skin, fat and muscle layers, in addition to the tumor. The model solution was approximated with the finite difference method (FDM) in an heterogeneous medium. Due to different blood perfusion parameters, a system of linear equations (constant perfusion), and a system of nonlinear equations (temperaturedependent perfusion) were obtained. To discretize the time domain, two explicit numerical strategies were used, the first one was using the classical Euler method, and the second one a predictor-corrector algorithm originated from the generalized trapezoidal alpha-family of time integration methods. Since the computational time required to solve a threedimensional model is large, two different parallel strategies were applied to the numerical method. The first one uses the OpenMP parallel programming API, and the second one the CUDA platform. The experimental results showed that the parallelization using OpenMP improves the performance up to 39 times faster than the sequential execution time, and the CUDA version was also efficient, yielding gains up to 242 times faster than the sequential execution time. Thus, this result ensures an execution time twice faster than the biological phenomenon. CNPQ::CIENCIAS EXATAS E DA TERRA Nanopartículas Hipertermia Biotransferência de calor Computação de alto desempenho CUDA OpenMP Nanoparticles Hyperthermia Bioheating High performace Computation CUDA OpenMP
129	PROGRAMAÇÃO PARALELA HÍBRIDA PARA CPU E GPU: UMA AVALIAÇÃO DO OPENACC FRENTE A OPENMP E CUDA / HYBRID PARALLEL PROGRAMMING FOR CPU AND GPU: AN EVALUATION OF OPENACC AS RELATED TO OPENMP AND CUDA Sulzbach, Maurício 22 August 2014 (has links) As a consequence of the CPU and GPU's architectures advance, in the last years there was a raise of the number of parallel programming APIs for both devices. While OpenMP is used to make parallel programs for the CPU, CUDA and OpenACC are employed in the parallel processing in the GPU. In the programming for the GPU, CUDA presents a model based on functions that make the source code extensive and prone to errors, in addition to leading to low development productivity. OpenACC emerged aiming to solve these problems and to be an alternative to the utilization of CUDA. Similar to OpenMP, this API has policies that ease the development of parallel applications that run on the GPU only. To further increase performance and take advantage of the parallel aspects of both CPU and GPU, it is possible to develop hybrid algorithms that split the processing on the two devices. In that sense, the main objective of this work is to verify if the advantages that OpenACC introduces are also positively reflected on the hybrid programming using OpenMP, if compared to the OpenMP + CUDA model. A second objective of this work is to identify aspects of the two programming models that could limit the performance or on the applications' development. As a way to accomplish these goals, this work presents the development of three hybrid parallel algorithms that are based on the Rodinia's benchmark algorithms, namely, RNG, Hotspot and SRAD, using the hybrid models OpenMP + CUDA and OpenMP + OpenACC. In these algorithms, the CPU part of the code is programmed using OpenMP, while it's assigned for the CUDA and OpenACC the parallel processing on the GPU. After the execution of the hybrid algorithms, the performance, efficiency and the processing's splitting in each one of the devices were analyzed. It was verified, through the hybrid algorithms' runs, that, in the two proposed programming models it was possible to outperform the performance of a parallel application that runs on a single API and in only one of the devices. In addition to that, in the hybrid algorithms RNG and Hotspot, CUDA's performance was superior to that of OpenACC, while in the SRAD algorithm OpenACC was faster than CUDA. / Como consequência do avanço das arquiteturas de CPU e GPU, nos últimos anos houve um aumento no número de APIs de programação paralela para os dois dispositivos. Enquanto que OpenMP é utilizada no processamento paralelo em CPU, CUDA e OpenACC são empregadas no processamento paralelo em GPU. Na programação para GPU, CUDA apresenta um modelo baseado em funções que deixam o código fonte extenso e propenso a erros, além de acarretar uma baixa produtividade no desenvolvimento. Objetivando solucionar esses problemas e sendo uma alternativa à utilização de CUDA surgiu o OpenACC. Semelhante ao OpenMP, essa API disponibiliza diretivas que facilitam o desenvolvimento de aplicações paralelas, porém para execução em GPU. Para aumentar ainda mais o desempenho e tirar proveito da capacidade de paralelismo de CPU e GPU, é possível desenvolver algoritmos híbridos que dividam o processamento nos dois dispositivos. Nesse sentido, este trabalho objetiva verificar se as facilidades que o OpenACC introduz também refletem positivamente na programação híbrida com OpenMP, se comparado ao modelo OpenMP + CUDA. Além disso, o trabalho visa relatar as limitações nos dois modelos de programação híbrida que possam influenciar no desempenho ou no desenvolvimento de aplicações. Como forma de cumprir essas metas, este trabalho apresenta o desenvolvimento de três algoritmos paralelos híbridos baseados nos algoritmos do benchmark Rodinia, a saber, RNG, Hotspot e SRAD, utilizando os modelos híbridos OpenMP + CUDA e OpenMP + OpenACC. Nesses algoritmos é atribuída ao OpenMP a execução paralela em CPU, enquanto que CUDA e OpenACC são responsáveis pelo processamento paralelo em GPU. Após as execuções dos algoritmos híbridos foram analisados o desempenho, a eficiência e a divisão da execução em cada um dos dispositivos. Verificou-se através das execuções dos algoritmos híbridos que nos dois modelos de programação propostos foi possível superar o desempenho de uma aplicação paralela em uma única API, com execução em apenas um dos dispositivos. Além disso, nos algoritmos híbridos RNG e Hotspot o desempenho de CUDA foi superior ao desempenho de OpenACC, enquanto que no algoritmo SRAD a API OpenACC apresentou uma execução mais rápida, se comparada à API CUDA. CPU GPU OpenMP CUDA OpenACC Programação paralela híbrida Desempenho CPU GPU OpenMP CUDA OpenACC Hybrid parallel programming Performance
130	Résolution de systèmes linéaires et non linéaires creux sur grappes de GPUs / Solving sparse linear and nonlinear systems on GPU clusters Ziane Khodja, Lilia 07 June 2013 (has links) Depuis quelques années, les grappes équipées de processeurs graphiques GPUs sont devenues des outils très attrayants pour le calcul parallèle haute performance. Dans cette thèse, nous avons conçu des algorithmes itératifs parallèles pour la résolution de systèmes linéaires et non linéaires creux de très grandes tailles sur grappes de GPUs. Dans un premier temps, nous nous sommes focalisés sur la résolution de systèmes linéaires creux à l'aide des méthodes itératives CG et GMRES. Les expérimentations ont montré qu'une grappe de GPUs est plus performante que son homologue grappe de CPUs pour la résolution de systèmes linéaires de très grandes tailles. Ensuite, nous avons mis en oeuvre des algorithmes parallèles synchrones et asynchrones des méthodes itératives Richardson et de relaxation par blocs pour la résolution de systèmes non linéaires creux. Nous avons constaté que les meilleurs solutions développées pour les CPUs ne sont pas nécessairement bien adaptées aux GPUs. En effet, les simulations effectuées sur une grappe de GPUs ont montré que les algorithmes Richardson sont largement plus efficaces que ceux de relaxation par blocs. De plus, elles ont aussi montré que la puissance de calcul des GPUs permet de réduire le rapport entre le temps d'exécution et celui de communication, ce qui favorise l'utilisation des algorithmes asynchrones sur des grappes de GPUs. Enfin, nous nous sommes intéressés aux grappes géographiquement distantes pour la résolution de systèmes linéaires creux. Dans ce contexte, nous avons utilisé la méthode de multi-décomposition à deux niveaux avec GMRES parallèle adaptée aux grappes de GPUs. Celle-ci utilise des itérations synchrones pour résoudre localement les sous-systèmes linéaires et des itérations asynchrones pour résoudre la globalité du système linéaire. / Or the past few years, the clusters equipped with GPUs have become attractive tools for high performance computing. In this thesis, we have designed parallel iterative algorithms for solving large sparse linear and nonlinear systems on GPU clusters. First, we have focused on solving sparse linear systems using CG and GMRES iterative methods. The experiments have shown that a GPU cluster is more efficient that its pure CPU counterpart for solving large sparse systems of linear equations. Then, we have implemented the synchronous and asynchronous algorithms of the Richardson and the block relaxation iterative methods for solving sparse nonlinear systems. We have noticed that the best solutions developed for the CPUs are not necessarily well suited to GPUs. Indeed, the experiments performed on a GPU cluster have shown that the parallel algorithms of the Richardson method are far more efficient than those of the block relaxation method. In addition, they have shown that the computing power of GPUs allows to reduce the ratio between the time of the computation over that of the communication, which favors the use of the asynchronous iteration on GPU clusters. Finally, we are interested in geographically distant clusters for solving large sparse linear systems. In this context, we have used a multisplitting two-stage method using parallel GMRES method adapted to GPU clusters. It uses the synchronous iteration to solve locally the sub-linear systems and the asynchronous one to solve the global sparse linear system. Méthodes itératives Parallélisme MPI/CUDA Grappes de GPUs Sparse linear and nonlinear systems Iterative methods MPI/CUDA parallelism GPU clusters 005.1

Search results