Global ETD Search

41	[pt] MAPEAMENTO DE SIMULAÇÃO DE FRATURA E FRAGMENTAÇÃO COESIVA PARA GPUS / [en] MAPPING COHESIVE FRACTURE AND FRAGMENTATION SIMULATIONS TO GPUS ANDREI ALHADEFF MONTEIRO 11 February 2016 (has links) [pt] Apresentamos um método computacional na GPU que lida com eventos de fragmentação dinâmica, simulados por meio de zona coesiva. Implementamos uma estrutura de dados topológica simples e especializada para malhas com triângulos ou tetraedros, projetada para rodar eficientemente e minimizar ocupação de memória na GPU. Apresentamos um código dinâmico paralelo, adaptativo e distribuído que implementa a formulação de modelo zona coesiva extrínsica (CZM), onde elementos são inseridos adaptativamente, onde e quando necessários. O principal objetivo na implementação deste framework computacional reside na habilidade de adaptar a malha de forma dinâmica e consistente, inserindo elementos coesivos nas facetas fraturadas e inserindo e removendo elementos e nós no caso da malha adaptativa. Apresentamos estratégias para refinar e simplificar a malha para lidar com simulações dinâmicas de malhas adaptativas na GPU. Utilizamos uma versão de escala reduzida do nosso modelo para demonstrar o impacto da variação de operações de ponto flutuante no padrão final de fratura. Uma nova estratégia de duplicar nós conhecidos como ghosts também é apresentado quando distribuindo a simulação em diversas partições de um cluster. Deste modo, resultados das simulações paralelas apresentam um ganho de desempenho ao adotar estratégias como distribuir trabalhos entre threads para o mesmo elemento e lançar vários threads por elemento. Para evitar concorrência ao acessar entidades compartilhadas, aplicamos a coloração de grafo para malhas não-adaptativas e percorrimento nodal no caso adaptativo. Experimentos demonstram que a eficiência da GPU aumenta com o número de nós e elementos da malha. / [en] A GPU-based computational framework is presented to deal with dynamic failure events simulated by means of cohesive zone elements. We employ a novel and simplified topological data structure relative to CPU implementation and specialized for meshes with triangles or tetrahedra, designed to run efficiently and minimize memory requirements on the GPU. We present a parallel, adaptive and distributed explicit dynamics code that implements an extrinsic cohesive zone formulation where the elements are inserted on-the-fly, when needed and where needed. The main challenge for implementing a GPU-based computational framework using an extrinsic cohesive zone formulation resides on being able to dynamically adapt the mesh, in a consistent way, by inserting cohesive elements on fractured facets and inserting or removing bulk elements and nodes in the adaptive mesh modification case. We present a strategy to refine and coarsen the mesh to handle dynamic mesh modification simulations on the GPU. We use a reduced scale version of the experimental specimen in the adaptive fracture simulations to demonstrate the impact of variation in floating point operations on the final fracture pattern. A novel strategy to duplicate ghost nodes when distributing the simulation in different compute nodes containing one GPU each is also presented. Results from parallel simulations show an increase in performance when adopting strategies such as distributing different jobs amongst threads for the same element and launching many threads per element. To avoid concurrency on accessing shared entities, we employ graph coloring for non-adaptive meshes and node traversal for the adaptive case. Experiments show that GPU efficiency increases with the number of nodes and bulk elements. [pt] METODO DO ELEMENTO FINITO [pt] CUDA [pt] ELEMENTOS COESIVOS [pt] SIMULACAO DE FRAGMENTACAO [pt] GPUS [en] FINITE ELEMENT METHOD [en] CUDA [en] COHESIVE ELEMENTS [en] FRAGMENTATION SIMULATION
42	Supporting Applications Involving Dynamic Data Structures and Irregular Memory Access on Emerging Parallel Platforms Ren, Bin 09 September 2014 (has links) No description available. Computer Science Irregular Data Structure Fine Grained Parallelism SIMD MIMD SSE GPUs Xeon Phi Static Analysis Runtime Analysis Offloading Python Redundancy Elimination
43	Accelerated sampling of energy landscapes Mantell, Rosemary Genevieve January 2017 (has links) In this project, various computational energy landscape methods were accelerated using graphics processing units (GPUs). Basin-hopping global optimisation was treated using a version of the limited-memory BFGS algorithm adapted for CUDA, in combination with GPU-acceleration of the potential calculation. The Lennard-Jones potential was implemented using CUDA, and an interface to the GPU-accelerated AMBER potential was constructed. These results were then extended to form the basis of a GPU-accelerated version of hybrid eigenvector-following. The doubly-nudged elastic band method was also accelerated using an interface to the potential calculation on GPU. Additionally, a local rigid body framework was adapted for GPU hardware. Tests were performed for eight biomolecules represented using the AMBER potential, ranging in size from 81 to 22\,811 atoms, and the effects of minimiser history size and local rigidification on the overall efficiency were analysed. Improvements relative to CPU performance of up to two orders of magnitude were obtained for the largest systems. These methods have been successfully applied to both biological systems and atomic clusters. An existing interface between a code for free energy basin-hopping and the SuiteSparse package for sparse Cholesky factorisation was refined, validated and tested. Tests were performed for both Lennard-Jones clusters and selected biomolecules represented using the AMBER potential. Significant acceleration of the vibrational frequency calculations was achieved, with negligible loss of accuracy, relative to the standard diagonalisation procedure. For the larger systems, exploiting sparsity reduces the computational cost by factors of 10 to 30. The acceleration of these computational energy landscape methods opens up the possibility of investigating much larger and more complex systems than previously accessible. A wide array of new applications are now computationally feasible. 660.0285
44	Paralelização em CUDA do algoritmo Aho-Corasick utilizando as hierarquias de memórias da GPU e nova compactação da Tabela de Transcrição de Estados Silva Júnior, José Bonifácio da 21 June 2017 (has links) The Intrusion Detection System (IDS) needs to compare the contents of all packets arriving at the network interface with a set of signatures for indicating possible attacks, a task that consumes much CPU processing time. In order to alleviate this problem, some researchers have tried to parallelize the IDS's comparison engine, transferring execution from the CPU to GPU. This This dissertation aims to parallelize the Brute Force and Aho-Corasick string matching algorithms and to propose a new compression of the State Transition Table of the Aho-Corasick algorithm in order to make it possible to use it in shared memory and accelerate the comparison of strings. The two algorithms were parallelized using the NVIDIA CUDA platform and executed in the GPU memories to allow a comparative analysis of the performance of these memories. Initially, the AC algorithm proved to be faster than the Brute Force algorithm and so it was followed for optimization. The AC algorithm was compressed and executed in parallel in shared memory, achieving a performance gain of 15% over other GPU memories and being 48 times faster than its serial version when testing with real network packets. When the tests were done with synthetic data (less random data) the gain reached 73% and the parallel algorithm was 56 times faster than its serial version. Thus, it can be seen that the use of compression in shared memory becomes a suitable solution to accelerate the processing of IDSs that need agility in the search for patterns. / Um Sistema de Detecção de Intrusão (IDS) necessita comparar o conteúdo de todos os pacotes que chegam na interface da rede com um conjunto de assinaturas que indicam possíveis ataques, tarefa esta que consome bastante tempo de processamento da CPU. Para amenizar esse problema, tem-se tentado paralelizar o motor de comparação dos IDSs transferindo sua execução da CPU para a GPU. Esta dissertação tem como objetivo fazer a paralelização dos algoritmos de comparação de strings Força-Bruta e Aho-Corasick e propor uma nova compactação da Tabela de Transição de Estados do algoritmo Aho-Corasick a fim de possibilitar o uso dela na memória compartilhada e acelerar a comparação de strings. Os dois algoritmos foram paralelizados utilizando a plataforma CUDA da NVIDIA e executados nas memórias da GPU a fim de possibilitar uma análise comparativa de desempenho dessas memórias. Inicialmente, o algoritmo AC mostrou-se mais veloz do que o algoritmo Força-Bruta e por isso seguiu-se para sua otimização. O algoritmo AC foi compactado e executado de forma paralela na memória compartilhada, alcançando um ganho de desempenho de 15% em relação às outras memórias da GPU e sendo 48 vezes mais rápido que sua versão na CPU quando os testes foram feitos com pacotes de redes reais. Já quando os testes foram feitos com dados sintéticos (dados menos aleatórios) o ganho chegou a 73% e o algoritmo paralelo chegou a ser 56 vezes mais rápido que sua versão serial. Com isso, pode-se perceber que o uso da compactação na memória compartilhada torna-se uma solução adequada para acelerar o processamento de IDSs que necessitem de agilidade na busca por padrões. Ciência da computação Computação de alto desempenho Arquitetura de computador Segurança da informação GPUS CUDA Algoritmos de comparação de strings Aho-Corasick IDS Hierarquia de memória da GPU Técnicas de compactação String matching algorithms Aho-Corasick GPU memory hierarchy Compaction techniques
45	Valorisation d’options américaines et Value At Risk de portefeuille sur cluster de GPUs/CPUs hétérogène / American option pricing and computation of the portfolio Value at risk on heterogeneous GPU-CPU cluster Benguigui, Michaël 27 August 2015 (has links) Le travail de recherche décrit dans cette thèse a pour objectif d'accélérer le temps de calcul pour valoriser des instruments financiers complexes, tels des options américaines sur panier de taille réaliste (par exemple de 40 sousjacents), en tirant partie de la puissance de calcul parallèle qu'offrent les accélérateurs graphiques (Graphics Processing Units). Dans ce but, nous partons d'un travail précédent, qui avait distribué l'algorithme de valorisation de J.Picazo, basé sur des simulations de Monte Carlo et l'apprentissage automatique. Nous en proposons une adaptation pour GPU, nous permettant de diviser par 2 le temps de calcul de cette précédente version distribuée sur un cluster de 64 cœurs CPU, expérimentée pour valoriser une option américaine sur 40 actifs. Cependant, le pricing de cette option de taille réaliste nécessite quelques heures de calcul. Nous étendons donc ce premier résultat dans le but de cibler un cluster de calculateurs, hétérogènes, mixant GPUs et CPUs, via OpenCL. Ainsi, nous accélérons fortement le temps de valorisation, même si les entrainements des différentes méthodes de classification expérimentées (AdaBoost, SVM) sont centralisés et constituent donc un point de blocage. Pour y remédier, nous évaluons alors l'utilisation d'une méthode de classification distribuée, basée sur l'utilisation de forêts aléatoires, rendant ainsi notre approche extensible. La dernière partie réutilise ces deux contributions dans le cas de calcul de la Value at Risk d’un portefeuille d'options, sur cluster hybride hétérogène. / The research work described in this thesis aims at speeding up the pricing of complex financial instruments, like an American option on a realistic size basket of assets (e.g. 40) by leveraging the parallel processing power of Graphics Processing Units. To this aim, we start from a previous research work that distributed the pricing algorithm based on Monte Carlo simulation and machine learning proposed by J. Picazo. We propose an adaptation of this distributed algorithm to take advantage of a single GPU. This allows us to get performances using one single GPU comparable to those measured using a 64 cores cluster for pricing a 40-assets basket American option. Still, on this realistic-size option, the pricing requires a handful of hours. Then we extend this first contribution in order to tackle a cluster of heterogeneous devices, both GPUs and CPUs programmed in OpenCL, at once. Doing this, we are able to drastically accelerate the option pricing time, even if the various classification methods we experiment with (AdaBoost, SVM) constitute a performance bottleneck. So, we consider instead an alternate, distributable approach, based upon Random Forests which allow our approach to become more scalable. The last part reuses these two contributions to tackle the Value at Risk evaluation of a complete portfolio of financial instruments, on a heterogeneous cluster of GPUs and CPUs. Parallélisme Distribution GPGPU OpenCL Mathématiques financières Calcul de risque Option américaine Monte-Carlo Apprentissage automatique Parallel computing Distributed computed GPGPU OpenCL Hybrid GPU-CPU cluster Finacial mathematics Risk American option Monte-Carlo Machine learning
46	Contributions to parallel stochastic simulation: Application of good software engineering practices to the distribution of pseudorandom streams in hybrid Monte-Carlo simulations Passerat-Palmbach, Jonathan 11 October 2013 (has links) (PDF) The race to computing power increases every day in the simulation community. A few years ago, scientists have started to harness the computing power of Graphics Processing Units (GPUs) to parallelize their simulations. As with any parallel architecture, not only the simulation model implementation has to be ported to the new parallel platform, but all the tools must be reimplemented as well. In the particular case of stochastic simulations, one of the major element of the implementation is the pseudorandom numbers source. Employing pseudorandom numbers in parallel applications is not a straightforward task, and it has to be done with caution in order not to introduce biases in the results of the simulation. This problematic has been studied since parallel architectures are available and is called pseudorandom stream distribution. While the literature is full of solutions to handle pseudorandom stream distribution on CPU-based parallel platforms, the young GPU programming community cannot display the same experience yet. In this thesis, we study how to correctly distribute pseudorandom streams on GPU. From the existing solutions, we identified a need for good software engineering solutions, coupled to sound theoretical choices in the implementation. We propose a set of guidelines to follow when a PRNG has to be ported to GPU, and put these advice into practice in a software library called ShoveRand. This library is used in a stochastic Polymer Folding model that we have implemented in C++/CUDA. Pseudorandom streams distribution on manycore architectures is also one of our concerns. It resulted in a contribution named TaskLocalRandom, which targets parallel Java applications using pseudorandom numbers and task frameworks. Eventually, we share a reflection on the methods to choose the right parallel platform for a given application. In this way, we propose to automatically build prototypes of the parallel application running on a wide set of architectures. This approach relies on existing software engineering tools from the Java and Scala community, most of them generating OpenCL source code from a high-level abstraction layer. Pseudorandom Number Generation (PRNG) High Performance Computing (HPC) Software Engineering Stochastic Simulation Graphics Processing Units (GPUs) GPU Programming Automatic Parallelization
47	ACCELERATING SPARSE MACHINE LEARNING INFERENCE Ashish Gondimalla (14214179) 17 May 2024 (has links) <p>Convolutional neural networks (CNNs) have become important workloads due to their<br> impressive accuracy in tasks like image classification and recognition. Convolution operations<br> are compute intensive, and this cost profoundly increases with newer and better CNN models.<br> However, convolutions come with characteristics such as sparsity which can be exploited. In<br> this dissertation, we propose three different works to capture sparsity for faster performance<br> and reduced energy. </p> <p><br></p> <p>The first work is an accelerator design called <em>SparTen</em> for improving two-<br> sided sparsity (i.e, sparsity in both filters and feature maps) convolutions with fine-grained<br> sparsity. <em>SparTen</em> identifies efficient inner join as the key primitive for hardware acceleration<br> of sparse convolution. In addition, <em>SparTen</em> proposes load balancing schemes for higher<br> compute unit utilization. <em>SparTen</em> performs 4.7x, 1.8x and 3x better than dense architecture,<br> one-sided architecture and SCNN, the previous state of the art accelerator. The second work<br> <em>BARISTA</em> scales up SparTen (and SparTen like proposals) to large-scale implementation<br> with as many compute units as recent dense accelerators (e.g., Googles Tensor processing<br> unit) to achieve full speedups afforded by sparsity. However at such large scales, buffering,<br> on-chip bandwidth, and compute utilization are highly intertwined where optimizing for<br> one factor strains another and may invalidate some optimizations proposed in small-scale<br> implementations. <em>BARISTA</em> proposes novel techniques to balance the three factors in large-<br> scale accelerators. <em>BARISTA</em> performs 5.4x, 2.2x, 1.7x and 2.5x better than dense, one-<br> sided, naively scaled two-sided and an iso-area two-sided architecture, respectively. The last<br> work, <em>EUREKA</em> builds an efficient tensor core to execute dense, structured and unstructured<br> sparsity with losing efficiency. <em>EUREKA</em> achieves this by proposing novel techniques to<br> improve compute utilization by slightly tweaking operand stationarity. <em>EUREKA</em> achieves a<br> speedup of 5x, 2.5x, along with 3.2x and 1.7x energy reductions over Dense and structured<br> sparse execution respectively. <em>EUREKA</em> only incurs area and power overheads of 6% and<br> 11.5%, respectively, over Ampere</p> Digital processor architectures Energy-efficient computing High performance computing Deep neural networks sparsity exploitation convolution neural network Machine learning inference Machine learning accelerators GPUs tensor cores Computer Engineering Computer Architecture ASIC Computer systems organization Special purpose systems Sparse tensors Sparse matrix multiplication

Page generated in 0.0502 seconds