• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 475
  • 88
  • 87
  • 56
  • 43
  • 21
  • 14
  • 14
  • 10
  • 5
  • 5
  • 3
  • 3
  • 3
  • 3
  • Tagged with
  • 988
  • 321
  • 203
  • 183
  • 168
  • 165
  • 154
  • 138
  • 124
  • 104
  • 96
  • 95
  • 93
  • 88
  • 83
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
351

Estudo para otimização do algoritmo Non-local means visando aplicações em tempo real

Silva, Hamilton Soares da 25 July 2014 (has links)
Made available in DSpace on 2015-05-08T14:59:57Z (GMT). No. of bitstreams: 1 arquivototal.pdf: 3935872 bytes, checksum: 5a4c90590e53b3ea1d71bbe61a628b56 (MD5) Previous issue date: 2014-07-25 / Coordenação de Aperfeiçoamento de Pessoal de Nível Superior / The aim of this work is to study the non-local means algorithm and propose techniques to optimize and implement this algorithm for its application in real-time. Two alternatives are suggested for implementation. The first deals with the development of an accelerator card for computers, which has a PCI bus containing specialized hardware that implements the NLM filter. The second implementation uses densely GPU multiprocessor environment, which exists in the parent video. Both proposals significantly accelerates the NLM algorithm, while maintains the same visual quality of traditional software implementations, enabling real-time use. Image denoising is an important area for digital image processing. Recently, its use is becoming more popular due to improvements of of the new acquisition equipments and, thus, the increase of image resolution that favors the occurrence of such perturbations. It is widely studied in the fields of image processing, computer vision and predictive maintenance of electrical substations, motors, tires, building facilities, pipes and fittings, focusing on reducing the noise without removing details of the original image. Several approaches have been proposed for filtering noise. One of such approaches is the non-local method called Non-Local Means (NLM), which uses the entire image rather than local information and stands out as the state of the art. However, a problem in this method is its high computational complexity, which turns its application almost impossible in real time applications, even for small images / O propósito deste trabalho é estudar o algoritmo non-local means(NLM) e propor técnicas para otimizar e implementar o referido algoritmo visando sua aplicação em tempo real. Ao todo são sugeridas duas alternativas de implementação. A primeira trata do desenvolvimento de uma placa aceleradora para computadores que possuam Barramento PCI, contendo um hardware especializado que implementa o Filtro NLM. A segunda implementação utiliza o ambiente densamente multiprocessado GPU, existente nas controladoras de vídeo. As duas propostas aceleraram significativamente o algoritmo NLM, mantendo a mesma qualidade visual das implementações tradicionais em software, tornando possível sua utilização em tempo real. A filtragem de ruídos é uma área importante para o processamento digital de imagens, sendo cada vez mais utilizada devido as melhorias dos novos equipamentos de captação, e o consequente aumento da resolução da imagem, que favorece o aparecimento dessas perturbações. Ela é amplamente estudada nos campos de tratamento de imagens, visão computacional e manutenção preditiva de subestações elétricas, motores, pneus, instalações prediais, tubos e conexões, focando em reduzir os ruídos sem que se remova os detalhes da imagem original. Várias abordagens foram propostas para filtragem de ruídos, uma delas é o método não-local, chamado de Non-Local Means (NLM), que não só utiliza as informações locais, mas a imagem inteira, destaca-se como o estado da arte, porém, há um problema neste método, que é a sua alta complexidade computacional, que o torna praticamente inviável de ser utilizado em aplicações em tempo real, até mesmo para imagens pequenas
352

Restauração de imagens de microscopia de força atômica com uso da regularização de Tikhonov via processamento em GPU / Image restoration from atomic force microscopy using the Tikhonov regularization via GPU processing

Augusto Garcia Almeida 04 March 2013 (has links)
A Restauração de Imagens é uma técnica que possui aplicações em várias áreas, por exemplo, medicina, biologia, eletrônica, e outras, onde um dos objetivos da restauração de imagens é melhorar o aspecto final de imagens de amostras que por algum motivo apresentam imperfeições ou borramentos. As imagens obtidas pelo Microscópio de Força Atômica apresentam borramentos causados pela interação de forças entre a ponteira do microscópio e a amostra em estudo. Além disso apresentam ruídos aditivos causados pelo ambiente. Neste trabalho é proposta uma forma de paralelização em GPU de um algoritmo de natureza serial que tem por fim a Restauração de Imagens de Microscopia de Força Atômica baseado na Regularização de Tikhonov. / Image Restoration is a technique which has applications in several areas, e.g., medicine, biology, electronics, and others, where one of the goals is to improve the final appearance of the images of samples, that have for some reason, imperfections or blurring. The images obtained by Atomic Force Microscope have blurring caused by the interaction forces between the tip of the microscope and the sample under study. Moreover exhibit additive noise caused by the environment. This thesis proposes a way to make a parallelization on a GPU of a serial algorithm of which is a Image Restoration of Images from Atomic Force Microscopy using Tikhonov Regularization.
353

Reconstrução de imagens por tomografia por impedância elétrica utilizando recozimento simulado massivamente paralelizado. / Image reconstruction through electrical impedance tomography using massively parallelized simulated annealing.

Renato Seiji Tavares 06 May 2016 (has links)
A tomografia por impedância elétrica é uma modalidade de imageamento médico recente, com diversas vantagens sobre as demais modalidades já consolidadas. O recozimento simulado é um algoritmo que apresentada qualidade de solução, mesmo com a utilização de uma regularização simples e sem informação a priori. Entretanto, existe a necessidade de reduzir o tempo de processamento. Este trabalho avança nessa direção, com a apresentação de um método de reconstrução que utiliza o recozimento simulado e paralelização massiva em GPU. A paralelização das operações matriciais em GPU é explicada, com uma estratégia de agendamento de threads que permite a paralelização efetiva de algoritmos, até então, considerados não paralelizáveis. Técnicas para sua aceleração são discutidas, como a heurística de fora para dentro. É proposta uma nova representação de matrizes esparsas voltada para as características da arquitetura CUDA, visando um melhor acesso à memória global do dispositivo e melhor utilização das threads. Esta nova representação de matriz mostrou-se vantajosa em relação aos formatos mais utilizados. Em seguida, a paralelização massiva do problema inverso da TIE, utilizando recozimento simulado, é estudada, com uma proposta de abordagem híbrida com paralelização tanto em CPU quanto GPU. Os resultados obtidos para a paralelização do problema inverso são superiores aos do problema direto. A GPU satura em aproximadamente 7.000 nós, a partir do qual o ganho em desempenho é de aproximadamente 5 vezes. A utilização de GPUs é viável para a reconstrução de imagens de tomografia por impedância elétrica. / Electrical impedance tomography is a new medical imaging modality with remarkable advatanges over other stablished modalities. Simulated annealing is an algorithm that renders quality solutions despite the use of simple regularization methods and the absence of a priori information. However, it remains the need to reduce its processing time. This work takes a step in this direction, presenting a method for the reconstruction of EIT images using simulated annealing and GPU parallelization. The parallelization of matrix operations in GPU is explained, with a thread scheduling strategy that allows the effective parallelization of not-yet effectively parallelized algorithms. There are strategies for improving its performance, such as the presented outside-in heuristic. It is proposed a new sparse matrix representation focused on the CUDA architecture characteristics, with improved global memory access patterns and thread efficiency. This new matrix representation showed several advantages over the most common formats. The massive parallelization of the TIE\'s inverse problem using simulated annealing is studied, with a proposed hybrid approach that uses parallelization in both CPU and GPU. Results showed that the performance gain for the inverse problem is higher than the one obtained for the forward problem. The GPU device saturates with meshes of size of approximately 7,000 nodes, with a performance gain around 5 times faster than serial implementations. GPU parallelization may be used for the reconstruction of electrical impedance tomography images.
354

Junção de conjuntos por similaridade explorando paralelismo multinível em GPUs / Set similarity joins exploring multilevel parallelism on GPUs

Ribeiro Junior, Sidney 29 August 2017 (has links)
Submitted by Luciana Ferreira (lucgeral@gmail.com) on 2017-10-05T11:30:17Z No. of bitstreams: 2 Dissertação - Sidney Ribeiro Junior - 2017.pdf: 1832065 bytes, checksum: 41b96bdea09ea7b5ddb6551265e0622b (MD5) license_rdf: 0 bytes, checksum: d41d8cd98f00b204e9800998ecf8427e (MD5) / Approved for entry into archive by Luciana Ferreira (lucgeral@gmail.com) on 2017-10-05T11:30:38Z (GMT) No. of bitstreams: 2 Dissertação - Sidney Ribeiro Junior - 2017.pdf: 1832065 bytes, checksum: 41b96bdea09ea7b5ddb6551265e0622b (MD5) license_rdf: 0 bytes, checksum: d41d8cd98f00b204e9800998ecf8427e (MD5) / Made available in DSpace on 2017-10-05T11:30:38Z (GMT). No. of bitstreams: 2 Dissertação - Sidney Ribeiro Junior - 2017.pdf: 1832065 bytes, checksum: 41b96bdea09ea7b5ddb6551265e0622b (MD5) license_rdf: 0 bytes, checksum: d41d8cd98f00b204e9800998ecf8427e (MD5) Previous issue date: 2017-08-29 / Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - CAPES / Similarity Join is an important operation for information retrieval, near duplicate detection, data analysis etc. State-of-the-art algorithms for similarity join use a technique known as prefix filtering to reduce the amount of sets to be entirely compared by previously discarding dissimilar sets. However, prefix filtering is only effective when looking for very similar data. An alternative to speedup the similarity join when prefix filtering is not efficient is to explore parallelism. In this work we developed three multi-level fine-grained parallel algorithms for many-core architectures (such as modern Graphic Processing Units) to solve the similarity join problem. The proposed algorithms have shown speedup gains of 109x and 17x when compared with sequential (ppjoin) and parallel (fgssjoin) state-of-the-art solutions, respectively, on standard real text databases. / A Junção por Similaridade é uma operação importante no contexto de recuperação da informação, identificação de duplicatas, análise de dados etc. Os algoritmos do estado da arte que realizam a junção por similaridade utilizam uma técnica chamada filtragem por prefixo, que diminui a quantidade de pares a serem totalmente comparados ao descartar previamente pares dissimilares. No entanto, a filtragem por prefixo é eficaz apenas quando se deseja encontrar pares muito similares. Uma alternativa para melhorar o desempenho da junção por similaridade quando a filtragem por prefixo é ineficaz, é explorar paralelismo. Neste trabalho foram desenvolvidos três algoritmos com paralelismo multinível de granularidade fina para arquiteturas many-core (como as modernas Unidades de Processamento Gráfico) para resolver o problema da junção por similaridade. Os algoritmos desenvolvidos demonstraram ganhos de speedup de até 109x e 17x em relação às soluções do estado da arte sequencial (ppjoin) e paralela (fgssjoin), respectivamente, quando executado sobre bases de dados textuais padrão reais.
355

Agrupando dados e kernels de um simulador cardíaco em um ambiente multi-GPU

Cordeiro, Raphael Pereira 10 March 2017 (has links)
Submitted by Renata Lopes (renatasil82@gmail.com) on 2017-07-04T17:30:00Z No. of bitstreams: 1 raphaelpereiracordeiro.pdf: 17027543 bytes, checksum: 91ef68c2021ff4c93dc8b4fe66217cf2 (MD5) / Approved for entry into archive by Adriana Oliveira (adriana.oliveira@ufjf.edu.br) on 2017-08-08T13:42:41Z (GMT) No. of bitstreams: 1 raphaelpereiracordeiro.pdf: 17027543 bytes, checksum: 91ef68c2021ff4c93dc8b4fe66217cf2 (MD5) / Made available in DSpace on 2017-08-08T13:42:41Z (GMT). No. of bitstreams: 1 raphaelpereiracordeiro.pdf: 17027543 bytes, checksum: 91ef68c2021ff4c93dc8b4fe66217cf2 (MD5) Previous issue date: 2017-03-10 / A modelagem computacional é uma ferramenta útil no estudo de diversos fenômenos complexos, como o comportamento eletro-mecânico do coração em condições normais e patológicas, sendo importante para o desenvolvimento de novos medicamentos e métodos de combate às doenças cardíacas. A alta complexidade de processos biofísicos se traduz em complexos modelos matemáticos e computacionais, o que faz com que simulações cardíacas necessitem de um grande poder computacional para serem executadas. Logo, o estado da arte em simuladores cardíacos é implementado para ser executado em arquiteturas paralelas. Este trabalho apresenta a implementação e avaliação de um método com dados e kernel agregados, método este utilizado para reduzir o tempo de computação de códigos que executam em ambientes computacionais compostos de múltiplas unidades de processamento gráfico (Graphics Processing Unit ou simplesmente GPUs). Este método foi testado na computação de uma importante parte da simulação da eletrofisiologia do coração, a resolução das equações diferenciais ordinárias (EDOs), resultando em uma redução pela metade do tempo necessário para a sua resolução, quando comparado com o esquema onde este método não foi implementado. Com o uso da técnica proposta neste trabalho, o tempo total de execução das simulações cardíacas foi reduzido em até 25%. / Computational modeling is a useful tool to study many distinct and complex phenomena, such as to describe the electrical and mechanical behavior of the heart, under normal and pathological conditions. The high complexity of the associated biophysical processes translates into complex mathematical and computational models. This, in turn, translates to cardiac simulators that demand a lot of computational power to be executed. Therefore, most of the state-of-the-art cardiac simulators are implemented to run in parallel architectures. In this work a new coalesced data and kernel scheme is evaluated. Its objective is to reduce the execution costs of cardiac simulations that run on multi-GPU environments. The new scheme was tested for an important part of the simulator, the solution of the systems of Ordinary Differential Equations (ODEs). The results have shown that the proposed scheme is very effective. The execution time to solve the systems of ODEs on the multi-GPU environment was reduced by half, when compared to a scheme that does not implemented the proposed data and kernel coalescing. As a result, the total execution time of cardiac simulations was 25% faster.
356

A Haptic Device Interface for Medical Simulations using OpenCL / Ett haptiskt gränssnitt för medicinska simuleringar med OpenCL

Machwirth, Mattias January 2013 (has links)
The project evaluates how well a haptic device can be used to interact with a visualization of volumetric data. Since the interface to the haptic device require explicit surface descriptions, triangles had to be constructed from the volumetric data. The algorithm used to extract these triangles is marching cubes. The triangles produced by marching cubes are then transmitted to the haptic device to enable the force feedback. Marching cubes was suitable for parallelization and it was executed using OpenCL. Graphs in the report shows how this parallelization ran almost 70 times faster than the sequential CPU counterpart of the same algorithm. Further development of the project would give medical students the opportunity to practice difficult procedures on a simulation instead of a real patient. This would give a realistic and accurate simulation to practice on. / Projektet går ut på att utvärdera hur väl en haptisk utrustning går att använda för att interagera med en visualisering av volumetrisk data. Eftersom haptikutrustningen krävde explicit beskrivna ytor, krävdes först en triangelgenerering utifrån den volymetriska datan. Algoritmen som används till detta är marching cubes. Trianglarna som producerades med hjälp av marching cubes skickas sedan vidare till den haptiska utrustningen för att kunna få gensvar i form av krafter för att utnyttja sig av känsel och inte bara syn. Eftersom marching cubes lämpas för en parallelisering användes OpenCL för att snabba upp algoritmen. Grafer i projektet visar hur denna algoritm exekveras upp emot 70 gånger snabbare när algoritmen körs som en kernel i OpenCL istället för ekvensiellt på CPUn. Tanken är att när vidareutveckling av projektet har gjorts i god mån, kan detta användas av läkarstuderande där övning av svåra snitt kan ske i en verklighetstrogen simulering innan samma ingrepp utförs på en individ.
357

Approche de conception haut-niveau pour l'accélération matérielle de calcul haute performance en finance / High-level approach for hardware acceleration of high-performance computing in finance

Mena morales, Valentin 12 July 2017 (has links)
Les applications de calcul haute-performance (HPC) nécessitent des capacités de calcul conséquentes, qui sont généralement atteintes à l'aide de fermes de serveurs au détriment de la consommation énergétique d'une telle solution. L'accélération d'applications sur des plateformes hétérogènes, comme par exemple des FPGA ou des GPU, permet de réduire la consommation énergétique et correspond donc à un compromis architectural plus séduisant. Elle s'accompagne cependant d'un changement de paradigme de programmation et les plateformes hétérogènes sont plus complexes à prendre en main pour des experts logiciels. C'est particulièrement le cas des développeurs de produits financiers en finance quantitative. De plus, les applications financières évoluent continuellement pour s'adapter aux demandes législatives et concurrentielles du domaine, ce qui renforce les contraintes de programmabilité de solutions d'accélérations. Dans ce contexte, l'utilisation de flots haut-niveaux tels que la synthèse haut-niveau (HLS) pour programmer des accélérateurs FPGA n'est pas suffisante. Une approche spécifique au domaine peut fournir une réponse à la demande en performance, sans que la programmabilité d'applications accélérées ne soit compromise.Nous proposons dans cette thèse une approche de conception haut-niveau reposant sur le standard de programmation hétérogène OpenCL. Cette approche repose notamment sur la nouvelle implémentation d'OpenCL pour FPGA introduite récemment par Altera. Quatre contributions principales sont apportées : (1) une étude initiale d'intégration de c'urs de calculs matériels à une librairie logicielle de calcul financier (QuantLib), (2) une exploration d'architectures et de leur performances respectives, ainsi que la conception d'une architecture dédiée pour l'évaluation d'option américaine et l'évaluation de volatilité implicite à partir d'un flot haut-niveau de conception, (3) la caractérisation détaillée d'une plateforme Altera OpenCL, des opérateurs élémentaires, des surcouches de contrôle et des liens de communication qui la compose, (4) une proposition d'un flot de compilation spécifique au domaine financier, reposant sur cette dernière caractérisation, ainsi que sur une description des applications financières considérées, à savoir l'évaluation d'options. / The need for resources in High Performance Computing (HPC) is generally met by scaling up server farms, to the detriment of the energy consumption of such a solution. Accelerating HPC application on heterogeneous platforms, such as FPGAs or GPUs, offers a better architectural compromise as they can reduce the energy consumption of a deployed system. Therefore, a change of programming paradigm is needed to support this heterogeneous acceleration, which trickles down to an increased level of programming complexity tackled by software experts. This is most notably the case for developers in quantitative finance. Applications in this field are constantly evolving and increasing in complexity to stay competitive and comply with legislative changes. This puts even more pressure on the programmability of acceleration solutions. In this context, the use of high-level development and design flows, such as High-Level Synthesis (HLS) for programming FPGAs, is not enough. A domain-specific approach can help to reach performance requirements, without impairing the programmability of accelerated applications.We propose in this thesis a high-level design approach that relies on OpenCL, as a heterogeneous programming standard. More precisely, a recent implementation of OpenCL for Altera FPGA is used. In this context, four main contributions are proposed in this thesis: (1) an initial study of the integration of hardware computing cores to a software library for quantitative finance (QuantLib), (2) an exploration of different architectures and their respective performances, as well as the design of a dedicated architecture for the pricing of American options and their implied volatility, based on a high-level design flow, (3) a detailed characterization of an Altera OpenCL platform, from elemental operators, memory accesses, control overlays, and up to the communication links it is made of, (4) a proposed compilation flow that is specific to the quantitative finance domain, and relying on the aforementioned characterization and on the description of the considered financial applications (option pricing).
358

PROGRAMAÇÃO PARALELA HÍBRIDA PARA CPU E GPU: UMA AVALIAÇÃO DO OPENACC FRENTE A OPENMP E CUDA / HYBRID PARALLEL PROGRAMMING FOR CPU AND GPU: AN EVALUATION OF OPENACC AS RELATED TO OPENMP AND CUDA

Sulzbach, Maurício 22 August 2014 (has links)
As a consequence of the CPU and GPU's architectures advance, in the last years there was a raise of the number of parallel programming APIs for both devices. While OpenMP is used to make parallel programs for the CPU, CUDA and OpenACC are employed in the parallel processing in the GPU. In the programming for the GPU, CUDA presents a model based on functions that make the source code extensive and prone to errors, in addition to leading to low development productivity. OpenACC emerged aiming to solve these problems and to be an alternative to the utilization of CUDA. Similar to OpenMP, this API has policies that ease the development of parallel applications that run on the GPU only. To further increase performance and take advantage of the parallel aspects of both CPU and GPU, it is possible to develop hybrid algorithms that split the processing on the two devices. In that sense, the main objective of this work is to verify if the advantages that OpenACC introduces are also positively reflected on the hybrid programming using OpenMP, if compared to the OpenMP + CUDA model. A second objective of this work is to identify aspects of the two programming models that could limit the performance or on the applications' development. As a way to accomplish these goals, this work presents the development of three hybrid parallel algorithms that are based on the Rodinia's benchmark algorithms, namely, RNG, Hotspot and SRAD, using the hybrid models OpenMP + CUDA and OpenMP + OpenACC. In these algorithms, the CPU part of the code is programmed using OpenMP, while it's assigned for the CUDA and OpenACC the parallel processing on the GPU. After the execution of the hybrid algorithms, the performance, efficiency and the processing's splitting in each one of the devices were analyzed. It was verified, through the hybrid algorithms' runs, that, in the two proposed programming models it was possible to outperform the performance of a parallel application that runs on a single API and in only one of the devices. In addition to that, in the hybrid algorithms RNG and Hotspot, CUDA's performance was superior to that of OpenACC, while in the SRAD algorithm OpenACC was faster than CUDA. / Como consequência do avanço das arquiteturas de CPU e GPU, nos últimos anos houve um aumento no número de APIs de programação paralela para os dois dispositivos. Enquanto que OpenMP é utilizada no processamento paralelo em CPU, CUDA e OpenACC são empregadas no processamento paralelo em GPU. Na programação para GPU, CUDA apresenta um modelo baseado em funções que deixam o código fonte extenso e propenso a erros, além de acarretar uma baixa produtividade no desenvolvimento. Objetivando solucionar esses problemas e sendo uma alternativa à utilização de CUDA surgiu o OpenACC. Semelhante ao OpenMP, essa API disponibiliza diretivas que facilitam o desenvolvimento de aplicações paralelas, porém para execução em GPU. Para aumentar ainda mais o desempenho e tirar proveito da capacidade de paralelismo de CPU e GPU, é possível desenvolver algoritmos híbridos que dividam o processamento nos dois dispositivos. Nesse sentido, este trabalho objetiva verificar se as facilidades que o OpenACC introduz também refletem positivamente na programação híbrida com OpenMP, se comparado ao modelo OpenMP + CUDA. Além disso, o trabalho visa relatar as limitações nos dois modelos de programação híbrida que possam influenciar no desempenho ou no desenvolvimento de aplicações. Como forma de cumprir essas metas, este trabalho apresenta o desenvolvimento de três algoritmos paralelos híbridos baseados nos algoritmos do benchmark Rodinia, a saber, RNG, Hotspot e SRAD, utilizando os modelos híbridos OpenMP + CUDA e OpenMP + OpenACC. Nesses algoritmos é atribuída ao OpenMP a execução paralela em CPU, enquanto que CUDA e OpenACC são responsáveis pelo processamento paralelo em GPU. Após as execuções dos algoritmos híbridos foram analisados o desempenho, a eficiência e a divisão da execução em cada um dos dispositivos. Verificou-se através das execuções dos algoritmos híbridos que nos dois modelos de programação propostos foi possível superar o desempenho de uma aplicação paralela em uma única API, com execução em apenas um dos dispositivos. Além disso, nos algoritmos híbridos RNG e Hotspot o desempenho de CUDA foi superior ao desempenho de OpenACC, enquanto que no algoritmo SRAD a API OpenACC apresentou uma execução mais rápida, se comparada à API CUDA.
359

Métaheuristiques pour l'optimisation combinatoire sur processeurs graphiques (GPU) / Metaheuristics for combinatorial optimization on Graphics Processing Unit (GPU)

Delevacq, Audrey 04 February 2013 (has links)
Plusieurs problèmes d'optimisation combinatoire sont dits NP-difficiles et ne peuvent être résolus de façon optimale par des algorithmes exacts. Les métaheuristiques ont prouvé qu'elles pouvaient être efficaces pour résoudre un grand nombre de ces problèmes en leur trouvant des solutions approchées en un temps raisonnable. Cependant, face à des instances de grande taille, elles ont besoin d'un temps de calcul et d'une quantité d'espace mémoire considérables pour être performantes dans l'exploration de l'espace de recherche. Par conséquent, l'intérêt voué à leur déploiement sur des architectures de calcul haute performance a augmenté durant ces dernières années. Les approches de parallélisation existantes suivent généralement les paradigmes de passage de messages ou de mémoire partagée qui conviennent aux architectures traditionnelles à base de microprocesseurs, aussi appelés CPU (Central Processing Unit).Cependant, la recherche évolue très rapidement dans le domaine du parallélisme et de nouvelles architectures émergent, notamment les accélérateurs matériels qui permettent de décharger le CPU de certaines de ses tâches. Parmi ceux-ci, les processeurs graphiques ou GPU (Graphics Processing Units) présentent une architecture massivement parallèle possédant un grand potentiel mais aussi de nouvelles difficultés d'algorithmique et de programmation. En effet, les modèles de parallélisation de métaheuristiques existants sont généralement inadaptés aux environnements de calcul de type GPU. Certains travaux ont d'ailleurs abordé ce sujet sans toutefois y apporter une vision globale et fondamentale.L'objectif général de cette thèse est de proposer un cadre de référence permettant l'implémentation efficace des métaheuristiques sur des architectures parallèles basées sur les GPU. Elle débute par un état de l'art décrivant les travaux existants sur la parallélisation GPU des métaheuristiques et les classifications générales des métaheuristiques parallèles. Une taxonomie originale est ensuite proposée afin de classifier les implémentations recensées et de formaliser les stratégies de parallélisation sur GPU dans un cadre méthodologique cohérent. Cette thèse vise également à valider cette taxonomie en exploitant ses principales composantes pour proposer des stratégies de parallélisation originales spécifiquement adaptées aux architectures GPU. Plusieurs implémentations performantes basées sur les métaheuristiques d'Optimisation par Colonie de Fourmis et de Recherche Locale Itérée sont ainsi proposées pour la résolution du problème du Voyageur de Commerce. Une étude expérimentale structurée et minutieuse est réalisée afin d'évaluer et de comparer la performance des approches autant au niveau de la qualité des solutions trouvées que de la réduction du temps de calcul. / Several combinatorial optimization problems are NP-hard and can only be solved optimally by exact algorithms for small instances. Metaheuristics have proved to be effective in solving many of these problems by finding approximate solutions in a reasonable time. However, dealing with large instances, they may require considerable computation time and amount of memory space to be efficient in the exploration of the search space. Therefore, the interest devoted to their deployment on high performance computing architectures has increased over the past years. Existing parallelization approaches generally follow the message-passing and shared-memory computing paradigms which are suitable for traditional architectures based on microprocessors, also called CPU (Central Processing Unit). However, research in the field of parallel computing is rapidly evolving and new architectures emerge, including hardware accelerators which offloads the CPU of some of its tasks. Among them, graphics processors or GPUs (Graphics Processing Units) have a massively parallel architecture with great potential but also imply new algorithmic and programming challenges. In fact, existing parallelization models of metaheuristics are generally unsuited to computing environments like GPUs. Few works have tackled this subject without providing a comprehensive and fundamental view of it.The general purpose of this thesis is to propose a framework for the effective implementation of metaheuristics on parallel architectures based on GPUs. It begins with a state of the art describing existing works on GPU parallelization of metaheuristics and general classifications of parallel metaheuristics. An original taxonomy is then designed to classify identified implementations and to formalize GPU parallelization strategies in a coherent methodological framework. This thesis also aims to validate this taxonomy by exploiting its main components to propose original parallelization strategies specifically tailored to GPU architectures. Several effective implementations based on Ant Colony Optimization and Iterated Local Search metaheuristics are thus proposed for solving the Travelling Salesman Problem. A structured and thorough experimental study is conducted to evaluate and compare the performance of approaches on criteria related to solution quality and computing time reduction.
360

Efficient Execution Of AMR Computations On GPU Systems

Raghavan, Hari K 11 1900 (has links) (PDF)
Adaptive Mesh Refinement (AMR) is a method which dynamically varies the spatio-temporal resolution of localized mesh regions in numerical simulations, based on the strength of the solution features. Due to high resolution discretization of localized regions of interests into rectangular mesh units called patches, AMR provides low cost of computations and high degree of accuracy. General purpose graphics processing units (GPGPUs) with their support for fine-grained parallelism, offer an attractive option for obtaining high performance for AMR applications. The data parallel computations of the finite difference schemes of AMR can be efficiently performed on GPGPUs. This research deals with challenges and develops techniques for efficient executions of AMR applications with uniform and non-uniform patches on GPUs. In the first part of the thesis, we optimize an AMR model with uniform patches. We have developed strategies for continuous online visualization of time evolving data for AMR applications executed on GPUs. In-situ visualization plays an important role for analyzing the time evolving characteristics of the domain structures. Continuous visualization of the output data for various time steps results in better study of the underlying domain and the model used for simulating the domain. We reorder the meshes for computations on the GPU based on the users input related to the subdomain that he wants to visualize. This makes the data available for visualization at a faster rate. We then perform asynchronous executions of the visualization steps and fix-up operations on the coarse meshes on the CPUs while the GPU advances the solution. By performing experiments on Tesla S1070 and Fermi C2070 clusters, we found that our strategies result in up to 60% improvement in response time and 16% improvement in the rate of visualization of frames over the existing strategy of performing fix-ups and visualization at the end of the time steps. The second part of the thesis deals with adaptive strategies for efficient execution of block structured AMR applications with non-uniform patches on GPUs. Most AMR approaches use patches of uniform sizes over regions of interests. Since this leads to over-refinement, some efforts have focused on forming patches of non-uniform dimensions to improve computational efficiency since the dimensions of a patch can be tuned to the geometry of a region of interest. While effective hybrid execution strategies exist for applications with uniform patches, our work considers efficient execution of non-uniform patches with different workloads. Our techniques include a geometric bin-packing method to load balance GPU computations and reduce thread idling, adaptive determination of amount of work to maximize asynchronism between CPU and GPU executions using a knapsack formulation, and scheduling communications for multi-GPU executions. We test our strategies for synthetic inputs as well as for traces from real applications. Our experiments on Tesla S1070 and Fermi C2070 clusters with both single-GPU and multi-GPU executions show that our strategies result in up to 69% improvement in performance over existing strategies. Our bin-packing based load balancing gives performance gains up to 39%, kernel optimizations give an improvement of up to 20%, and our strategies for adaptive asynchronism between CPU-GPU executions give performance improvements of up to 17% over default static asynchronous executions.

Page generated in 0.0536 seconds