Global ETD Search

131	[en] MANY-CORE FRAGMENTATION SIMULATION / [pt] IMPLEMENTAÇÃO DE SIMULAÇÃO DE FRAGMENTAÇÃO EM ARQUITETURA DE MULTIPROCESSADORES ANDREI ALHADEFF MONTEIRO 24 January 2017 (has links) [pt] Apresentamos um método computacional na GPU que lida com eventos de fragmentação dinâmica, simulados por meio de elementos de zona coesiva. O trabalho é dividido em duas partes. Na primeira parte, tratamos o pré-processamento de informações e a verificação de corretude e eficácia da inserção dinâmica de elementos coesivos em malhas grandes. Para tal, apresentamos uma simples estrutura de dados topológica composta de triângulos. Na segunda parte, o código explícito de dinâmica é apresentado, que implementa a formulação extrínsica de zona coesiva, onde os elementos são inseridos dinamicamente quando e onde forem necessários. O principal desafio da implementação na GPU, usando a formulação de zona coesiva extrínsica, é ser capaz de adaptar dinamicamente a malha de uma forma consistente, inserindo elementos coesivos nas facetas fraturadas. Para isso, a estrutura de dados convencional usada no código de elementos finitos (baseado na incidência de elementos) é estendida, armazenando, para cada elemento, referências para elementos adjacentes. Para evitar concorrência ao acessar entidades compartilhadas, uma estratégia convencional de coloração de grafos é adotada. Na fase de pré-processamento, cada nó do grafo (elementos na malha) é associado a uma cor diferente das cores de seus nós adjacentes. Desta maneira, elementos da mesma cor podem ser processados em paralelo sem concorrência. Todos os procedimentos necessários para a inserção de elementos coesivos nas facetas fraturadas e para computar propriedades de nós são feitas por threads associados a triângulos, invocando um kernel por cor. Computações em elementos coesivos existentes também são feitas baseadas nos elementos adjacentes. / [en] A GPU-based computational framework is presented to deal with dynamic failure events simulated by means of cohesive zone elements. The work is divided into two parts. In the first part, we deal with pre-processing of the information and verify the effectiveness of dynamic insertion of cohesive elements in large meshes. To this effect, we employ a simplified topological data structured specialized for triangles. In the second part, we present an explicit dynamics code that implements an extrinsic cohesive zone formulation where the elements are inserted on-the-fly, when needed and where needed. The main challenge for implementing a GPU-based computational framework using extrinsic cohesive zone formulation resides on being able to dynamically adapt the mesh in a consistent way, inserting cohesive elements on fractured facets. In order to handle that, we extend the conventional data structure used in finite element code (based on element incidence) and store, for each element, references to the adjacent elements. To avoid concurrency on accessing shared entities, we employ the conventional strategy of graph coloring. In a pre-processing phase, each node of the dual graph (bulk element of the mesh) is assigned a color different to the colors assigned to adjacent nodes. In that way, elements of a same color can be processed in parallel without concurrency. All the procedures needed for the insertion of cohesive elements along fracture facets and for computing node properties are performed by threads assigned to triangles, invoking one kernel per color. Computations on existing cohesive elements are also performed based on adjacent bulk elements. [pt] METODO DOS ELEMENTOS FINITOS [en] FINITE ELEMENT METHOD [pt] MULTIPLOS PROCESSADORES [pt] SIMULACAO DE FRAGMENTACAO [en] FRAGMENTATION SIMULATION [pt] ELEMENTOS COESIVOS [en] COHESIVE ELEMENTS [pt] CUDA [en] CUDA
132	[en] NEURONAL CIRCUIT SPECIFICATION LANGUAGE AND TOOLS FOR MODELLING THE VIRTUAL FLY BRAIN / [pt] LINGUAGEM DE ESPECIFICAÇÃO DE CIRCUITO NEURONAL E FERRAMENTAS PARA MODELAGEM DO CÉREBRO VIRTUAL DA MOSCA DA FRUTA DANIEL SALLES CHEVITARESE 03 May 2017 (has links) [pt] O cérebro da Drosophila é um sistema atrativo para o estudo da lógica do circuito neural, porque implementa o comportamento sensorial complexo com um sistema nervoso que compreende um número de componentes neurais que é de cinco ordens de grandeza menor do que o de vertebrados. A análise do conectoma da mosca, revelou que o seu cérebro compreende cerca de 40 subdivisões distintas chamadas unidades de processamento local (LPUs), cada uma das quais é caracterizada por circuitos de processamento únicos. As LPUs podem ser consideradas os blocos de construção funcionais da cérebro, uma vez que quase todas LPUs identificadas correspondem a regiões anatômicas do cérebro associadas com subsistemas funcionais específicos tais como a sensação e locomoção. Podemos, portanto, emular todo o cérebro da mosca, integrando suas LPUs constituintes. Embora o nosso conhecimento do circuito interno de muitas LPUs está longe de ser completa, análises dessas LPUs compostas pelos sistemas olfativos e visuais da mosca sugerem a existência de repetidos sub-circuitos que são essenciais para as funções de processamento de informações fornecidas por cada LPU. O desenvolvimento de modelos LPU plaussíveis, portanto, requer a habilidade de especificar e instanciar sub-circuitos, sem referência explícita a seus neurônios constituintes ou ligações internas. Para este fim, este trabalho apresenta um arcabouço para modelar e especificar circuitos do cérebro, proporcionando uma linguagem de especificação neural chamada CircuitML, uma API Python para melhor manipular arquivos CircuitML e um conector otimizado para neurokernel para a simulação desses LPUs em GPU. A CircuitML foi concebida como uma extensão para NeuroML (NML), que é uma linguagem para de descrição de redes neurais biológicas baseada em XML que fornece primitivas para a definição de sub-circuitos neurais. Sub-circuitos são dotados de portas de interface que permitem a sua ligação a outros sub-circuitos através de padrões de conectividade neural. / [en] The brain of the fruit y Drosophila Melanogaster is an attractive model system for studying the logic of neural circuit function because it implements complex sensory-driven behavior with a nervous system comprising a number of neural components that is five orders of magnitude smaller than that of vertebrates. Analysis of the fly s connectome, or neural connectivity map, using the extensive toolbox of genetic manipulation techniques developed for Drosophila, has revealed that its brain comprises about 40 distinct modular subdivisions called Local Processing Units (LPUs), each of which is characterized by a unique internal information processing circuitry. LPUs can be regarded as the functional building blocks of the y, since almost all identified LPUs have been found to correspond to anatomical regions of the y brain associated with specific functional subsystems such as sensation and locomotion. We can therefore emulate the entire y brain by integrating its constituent LPUs. Although our knowledge of the internal circuitry of many LPUs is far from complete, analyses of those LPUs comprised by the fly s olfactory and vision systems suggest the existence of repeated canonical sub-circuits that are integral to the information processing functions provided by each LPU. The development of plausible LPU models therefore requires the ability to specify and instantiate sub-circuits without explicit reference to their constituent neurons and internal connections. To this end, this work presents a framework to model and specify the circuit of the brain, providing a neural circuit specification language called CircuitML, a Python API to better handler CircuitML files and an optimized connector to neurokernel for the simulation of those LPUs on GPU. CircuitML has been designed as an extension to NeuroML (NML), which is an XML-based neural model description language that provides constructs for defining sub-circuits that comprise neural primitives. Sub-circuits are endowed with interface ports that enable their connection to other sub-circuits via neural connectivity patterns. [pt] MODELAGEM [en] MODELLING [pt] PADRONIZACAO [en] STANDARDIZATION [pt] NEUROCIENCIA [en] NEUROLOGY [pt] CUDA [en] CUDA [pt] DROSOPHILA MELANOGASTER [pt] XML [pt] API [pt] PYTHON [pt] NEUROML
133	Méthodes de génération automatique de code appliquées à l’algèbre linéaire numérique dans le calcul haute performance / Automatic code generation methods applied to numerical linear algebra in high performance computing Masliah, Ian 26 September 2016 (has links) Les architectures parallèles sont aujourd'hui présentes dans tous les systèmes informatiques, allant des smartphones aux supercalculateurs en passant par les ordinateurs de bureau. Programmer efficacement ces architectures en fonction des applications requiert un effort pluridisciplinaire portant sur les langages dédiés (Domain Specific Languages - DSL), les techniques de génération de code et d'optimisation, et les algorithmes numériques propres aux applications. Dans cette thèse, nous présentons une méthode de programmation haut niveau prenant en compte les caractéristiques des architectures hétérogènes et les propriétés existantes des matrices pour produire un solveur générique d'algèbre linéaire dense. Notre modèle de programmation supporte les transferts explicites et implicites entre un processeur (CPU) et un processeur graphique qui peut être généraliste (GPU) ou intégré (IGP). Dans la mesure où les GPU sont devenus un outil important pour le calcul haute performance, il est essentiel d'intégrer leur usage dans les plateformes de calcul. Une architecture récente telle que l'IGP requiert des connaissances supplémentaires pour pouvoir être programmée efficacement. Notre méthodologie a pour but de simplifier le développement sur ces architectures parallèles en utilisant des outils de programmation haut niveau. À titre d'exemple, nous avons développé un solveur de moindres carrés en précision mixte basé sur les équations semi-normales qui n'existait pas dans les bibliothèques actuelles. Nous avons par la suite étendu nos travaux à un modèle de programmation multi-étape ("multi-stage") pour résoudre les problèmes d'interopérabilité entre les modèles de programmation CPU et GPU. Nous utilisons cette technique pour générer automatiquement du code pour accélérateur à partir d'un code effectuant des opérations point par point ou utilisant des squelettes algorithmiques. L'approche multi-étape nous assure que le typage du code généré est valide. Nous avons ensuite montré que notre méthode est applicable à d'autres architectures et algorithmes. Les routines développées ont été intégrées dans une bibliothèque de calcul appelée NT2.Enfin, nous montrons comment la programmation haut niveau peut être appliquée à des calculs groupés et des contractions de tenseurs. Tout d'abord, nous expliquons comment concevoir un modèle de container en utilisant des techniques de programmation basées sur le C++ moderne (C++-14). Ensuite, nous avons implémenté un produit de matrices optimisé pour des matrices de petites tailles en utilisant des instructions SIMD. Pour ce faire, nous avons pris en compte les multiples problèmes liés au calcul groupé ainsi que les problèmes de localité mémoire et de vectorisation. En combinant la programmation haut niveau avec des techniques avancées de programmation parallèle, nous montrons qu'il est possible d'obtenir de meilleures performances que celles des bibliothèques numériques actuelles. / Parallelism in today's computer architectures is ubiquitous whether it be in supercomputers, workstations or on portable devices such as smartphones. Exploiting efficiently these systems for a specific application requires a multidisciplinary effort that concerns Domain Specific Languages (DSL), code generation and optimization techniques and application-specific numerical algorithms. In this PhD thesis, we present a method of high level programming that takes into account the features of heterogenous architectures and the properties of matrices to build a generic dense linear algebra solver. Our programming model supports both implicit or explicit data transfers to and from General-Purpose Graphics Processing Units (GPGPU) and Integrated Graphic Processors (IGPs). As GPUs have become an asset in high performance computing, incorporating their use in general solvers is an important issue. Recent architectures such as IGPs also require further knowledge to program them efficiently. Our methodology aims at simplifying the development on parallel architectures through the use of high level programming techniques. As an example, we developed a least-squares solver based on semi-normal equations in mixed precision that cannot be found in current libraries. This solver achieves similar performance as other mixed-precision algorithms. We extend our approach to a new multistage programming model that alleviates the interoperability problems between the CPU and GPU programming models. Our multistage approach is used to automatically generate GPU code for CPU-based element-wise expressions and parallel skeletons while allowing for type-safe program generation. We illustrate that this work can be applied to recent architectures and algorithms. The resulting code has been incorporated into a C++ library called NT2. Finally, we investigate how to apply high level programming techniques to batched computations and tensor contractions. We start by explaining how to design a simple data container using modern C++14 programming techniques. Then, we study the issues around batched computations, memory locality and code vectorization to implement a highly optimized matrix-matrix product for small sizes using SIMD instructions. By combining a high level programming approach and advanced parallel programming techniques, we show that we can outperform state of the art numerical libraries. C++ Programmation générique CUDA Meta programmation GPU Languages dédiés Programmation générative Algèbre linéaire C++ Generic programming CUDA Meta-Programming GPU Domain specific languages Generative programming Linear algebra
134	Context-aware automated refactoring for unified memory allocation in NVIDIA CUDA programs Nejadfard, Kian 25 June 2021 (has links) No description available. Computer Engineering Computer Science NVIDIA CUDA NVIDIA CUDA refactoring automated refactoring context-aware unified memory managed memory source-to-source translation LLVM Clang
135	Traitement de données multi-spectrales par calcul intensif et applications chez l'homme en imagerie par résonnance magnétique nucléaire / Processing of multi-spectral data by high performance computing and its applications on human nuclear magnetic resonance imaging Angeletti, Mélodie 21 February 2019 (has links) L'imagerie par résonance magnétique fonctionnelle (IRMf) étant une technique non invasive pour l'étude de cerveau, elle a été employée pour comprendre les mécanismes cérébraux sous-jacents à la prise alimentaire. Cependant, l'utilisation de stimuli liquides pour simuler la prise alimentaire engendre des difficultés supplémentaires par rapport aux stimulations visuellement habituellement mises en œuvre en IRMf. L'objectif de cette thèse a donc été de proposer une méthode robuste d'analyse des données tenant compte de la spécificité d'une stimulation alimentaire. Pour prendre en compte le mouvement dû à la déglutition, nous proposons une méthode de censure fondée uniquement sur le signal mesuré. Nous avons de plus perfectionné l'étape de normalisation des données afin de réduire la perte de signal. La principale contribution de cette thèse est d'implémenter l'algorithme de Ward de sorte que parcelliser l'ensemble du cerveau soit réalisable en quelques heures et sans avoir à réduire les données au préalable. Comme le calcul de la distance euclidienne entre toutes les paires de signaux des voxels représente une part importante de l'algorithme de Ward, nous proposons un algorithme cache-aware du calcul de la distance ainsi que trois parallélisations sur les architectures suivantes : architecture à mémoire partagée, architecture à mémoire distribuée et GPU NVIDIA. Une fois l'algorithme de Ward exécuté, il est possible d'explorer toutes les échelles de parcellisation. Nous considérons plusieurs critères pour évaluer la qualité de la parcellisation à une échelle donnée. À une échelle donnée, nous proposons soit de calculer des cartes de connectivités entre les parcelles, soit d'identifier les parcelles répondant à la stimulation à l'aide du coefficient de corrélation de Pearson. / As a non-invasive technology for studying brain imaging, functional magnetic resonance imaging (fMRI) has been employed to understand the brain underlying mechanisms of food intake. Using liquid stimuli to fake food intake adds difficulties which are not present in fMRI studies with visual stimuli. This PhD thesis aims to propose a robust method to analyse food stimulated fMRI data. To correct the data from swallowing movements, we have proposed to censure the data uniquely from the measured signal. We have also improved the normalization step of data between subjects to reduce signal loss.The main contribution of this thesis is the implementation of Ward's algorithm without data reduction. Thus, clustering the whole brain in several hours is now feasible. Because Euclidean distance computation is the main part of Ward algorithm, we have developed a cache-aware algorithm to compute the distance between each pair of voxels. Then, we have parallelized this algorithm for three architectures: shared-memory architecture, distributed memory architecture and NVIDIA GPGPU. Once Ward's algorithm has been applied, it is possible to explore multi-scale clustering of data. Several criteria are considered in order to evaluate the quality of clusters. For a given number of clusters, we have proposed to compute connectivity maps between clusters or to compute Pearson correlation coefficient to identify brain regions activated by the stimulation. IRMf alimentaire Parcellisation multi-échelle Algorithme de Ward Distance euclidienne Parallélisation OpenMP MPI CUDA Food fMRI Multi-scale clustering Ward's algorithm Euclidean distance Parallelisation OpenMP MPI CUDA
136	Trace-based Performance Analysis for Hardware Accelerators Juckeland, Guido 05 February 2013 (has links) This thesis presents how performance data from hardware accelerators can be included in event logs. It extends the capabilities of trace-based performance analysis to also monitor and record data from this novel parallelization layer. The increasing awareness to power consumption of computing devices has led to an interest in hybrid computing architectures as well. High-end computers, workstations, and mobile devices start to employ hardware accelerators to offload computationally intense and parallel tasks, while at the same time retaining a highly efficient scalar compute unit for non-parallel tasks. This execution pattern is typically asynchronous so that the scalar unit can resume other work while the hardware accelerator is busy. Performance analysis tools provided by the hardware accelerator vendors cover the situation of one host using one device very well. Yet, they do not address the needs of the high performance computing community. This thesis investigates ways to extend existing methods for recording events from highly parallel applications to also cover scenarios in which hardware accelerators aid these applications. After introducing a generic approach that is suitable for any API based acceleration paradigm, the thesis derives a suggestion for a generic performance API for hardware accelerators and its implementation with NVIDIA CUPTI. In a next step the visualization of event logs containing data from execution streams on different levels of parallelism is discussed. In order to overcome the limitations of classic performance profiles and timeline displays, a graph-based visualization using Parallel Performance Flow Graphs (PPFGs) is introduced. This novel technical approach is using program states in order to display similarities and differences between the potentially very large number of event streams and, thus, enables a fast way to spot load imbalances. The thesis concludes with the in-depth analysis of a case-study of PIConGPU---a highly parallel, multi-hybrid plasma physics simulation---that benefited greatly from the developed performance analysis methods. / Diese Dissertation zeigt, wie der Ablauf von Anwendungsteilen, die auf Hardwarebeschleuniger ausgelagert wurden, als Programmspur mit aufgezeichnet werden kann. Damit wird die bekannte Technik der Leistungsanalyse von Anwendungen mittels Programmspuren so erweitert, dass auch diese neue Parallelitätsebene mit erfasst wird. Die Beschränkungen von Computersystemen bezüglich der elektrischen Leistungsaufnahme hat zu einer steigenden Anzahl von hybriden Computerarchitekturen geführt. Sowohl Hochleistungsrechner, aber auch Arbeitsplatzcomputer und mobile Endgeräte nutzen heute Hardwarebeschleuniger um rechenintensive, parallele Programmteile auszulagern und so den skalaren Hauptprozessor zu entlasten und nur für nicht parallele Programmteile zu verwenden. Dieses Ausführungsschema ist typischerweise asynchron: der Skalarprozessor kann, während der Hardwarebeschleuniger rechnet, selbst weiterarbeiten. Die Leistungsanalyse-Werkzeuge der Hersteller von Hardwarebeschleunigern decken den Standardfall (ein Host-System mit einem Hardwarebeschleuniger) sehr gut ab, scheitern aber an einer Unterstützung von hochparallelen Rechnersystemen. Die vorliegende Dissertation untersucht, in wie weit auch multi-hybride Anwendungen die Aktivität von Hardwarebeschleunigern aufzeichnen können. Dazu wird die vorhandene Methode zur Erzeugung von Programmspuren für hochparallele Anwendungen entsprechend erweitert. In dieser Untersuchung wird zuerst eine allgemeine Methodik entwickelt, mit der sich für jede API-gestützte Hardwarebeschleunigung eine Programmspur erstellen lässt. Darauf aufbauend wird eine eigene Programmierschnittstelle entwickelt, die es ermöglicht weitere leistungsrelevante Daten aufzuzeichnen. Die Umsetzung dieser Schnittstelle wird am Beispiel von NVIDIA CUPTI darstellt. Ein weiterer Teil der Arbeit beschäftigt sich mit der Darstellung von Programmspuren, welche Aufzeichnungen von den unterschiedlichen Parallelitätsebenen enthalten. Um die Einschränkungen klassischer Leistungsprofile oder Zeitachsendarstellungen zu überwinden, wird mit den parallelen Programmablaufgraphen (PPFGs) eine neue graphenbasisierte Darstellungsform eingeführt. Dieser neuartige Ansatz zeigt eine Programmspur als eine Folge von Programmzuständen mit gemeinsamen und unterchiedlichen Abläufen. So können divergierendes Programmverhalten und Lastimbalancen deutlich einfacher lokalisiert werden. Die Arbeit schließt mit der detaillierten Analyse von PIConGPU -- einer multi-hybriden Simulation aus der Plasmaphysik --, die in großem Maße von den in dieser Arbeit entwickelten Analysemöglichkeiten profiert hat. info:eu-repo/classification/ddc/004 ddc:004
137	[pt] MAPEAMENTO DE SIMULAÇÃO DE FRATURA E FRAGMENTAÇÃO COESIVA PARA GPUS / [en] MAPPING COHESIVE FRACTURE AND FRAGMENTATION SIMULATIONS TO GPUS ANDREI ALHADEFF MONTEIRO 11 February 2016 (has links) [pt] Apresentamos um método computacional na GPU que lida com eventos de fragmentação dinâmica, simulados por meio de zona coesiva. Implementamos uma estrutura de dados topológica simples e especializada para malhas com triângulos ou tetraedros, projetada para rodar eficientemente e minimizar ocupação de memória na GPU. Apresentamos um código dinâmico paralelo, adaptativo e distribuído que implementa a formulação de modelo zona coesiva extrínsica (CZM), onde elementos são inseridos adaptativamente, onde e quando necessários. O principal objetivo na implementação deste framework computacional reside na habilidade de adaptar a malha de forma dinâmica e consistente, inserindo elementos coesivos nas facetas fraturadas e inserindo e removendo elementos e nós no caso da malha adaptativa. Apresentamos estratégias para refinar e simplificar a malha para lidar com simulações dinâmicas de malhas adaptativas na GPU. Utilizamos uma versão de escala reduzida do nosso modelo para demonstrar o impacto da variação de operações de ponto flutuante no padrão final de fratura. Uma nova estratégia de duplicar nós conhecidos como ghosts também é apresentado quando distribuindo a simulação em diversas partições de um cluster. Deste modo, resultados das simulações paralelas apresentam um ganho de desempenho ao adotar estratégias como distribuir trabalhos entre threads para o mesmo elemento e lançar vários threads por elemento. Para evitar concorrência ao acessar entidades compartilhadas, aplicamos a coloração de grafo para malhas não-adaptativas e percorrimento nodal no caso adaptativo. Experimentos demonstram que a eficiência da GPU aumenta com o número de nós e elementos da malha. / [en] A GPU-based computational framework is presented to deal with dynamic failure events simulated by means of cohesive zone elements. We employ a novel and simplified topological data structure relative to CPU implementation and specialized for meshes with triangles or tetrahedra, designed to run efficiently and minimize memory requirements on the GPU. We present a parallel, adaptive and distributed explicit dynamics code that implements an extrinsic cohesive zone formulation where the elements are inserted on-the-fly, when needed and where needed. The main challenge for implementing a GPU-based computational framework using an extrinsic cohesive zone formulation resides on being able to dynamically adapt the mesh, in a consistent way, by inserting cohesive elements on fractured facets and inserting or removing bulk elements and nodes in the adaptive mesh modification case. We present a strategy to refine and coarsen the mesh to handle dynamic mesh modification simulations on the GPU. We use a reduced scale version of the experimental specimen in the adaptive fracture simulations to demonstrate the impact of variation in floating point operations on the final fracture pattern. A novel strategy to duplicate ghost nodes when distributing the simulation in different compute nodes containing one GPU each is also presented. Results from parallel simulations show an increase in performance when adopting strategies such as distributing different jobs amongst threads for the same element and launching many threads per element. To avoid concurrency on accessing shared entities, we employ graph coloring for non-adaptive meshes and node traversal for the adaptive case. Experiments show that GPU efficiency increases with the number of nodes and bulk elements. [pt] METODO DO ELEMENTO FINITO [pt] CUDA [pt] ELEMENTOS COESIVOS [pt] SIMULACAO DE FRAGMENTACAO [pt] GPUS [en] FINITE ELEMENT METHOD [en] CUDA [en] COHESIVE ELEMENTS [en] FRAGMENTATION SIMULATION
138	Towards an Efficient Spectral Element Solver for Poisson’s Equation on Heterogeneous Platforms / Mot en effektiv spektrala element-lösare för Poissons ekvation på heterogena plattformar Nylund, Jonas January 2022 (has links) Neko is a project at KTH to refactor the widely used fluid dynamics solver Nek5000 to support modern hardware. Many aspects of the solver need adapting for use on GPUs, and one such part is the main communication kernel, the Gather-Scatter (GS) routine. To avoid race conditions in the kernel, atomic operations are used, which can be inefficient. To avoid the use of atomics, elements were grouped in such a way that when multiple writes to the same address are necessary, they will always come in blocks. This way, each block can be assigned to a single thread and handled sequentially, avoiding the need for atomic operations altogether. In the scope of the thesis, a Poisson solver was also ported from CPU to Nvidia GPUs. To optimise the Poisson solver, a batched matrix multiplication kernel was developed to efficiently perform small matrix multiplications in bulk, to better utilise the GPU. Optimisations using shared memory and kernel unification was done. The performance of the different implementations was tested on two systems using a GTX1660 and dual Nvidia A100 respectively. The results show only small differences in performance between the two versions of the GS kernels when only considering computational cost, and in a multi-rank setup the communication time completely overwhelms any potential difference. The shared memory matrix multiplication kernel yielded around a 20% performance boost for the Poisson solver. Both versions vastly outperformed cuBLAS. The unified kernel also had a large positive impact on the performance, yielding up to a 50% increase in throughput. / Neko är ett KTH-projekt med syfte att vidareutveckla det populära beräkningsströmningsdynamik-programmet Nek5000 för moderna datorsystem. Speciell vikt har lagts vid att stödja heterogena plattformar med dedikerade accelleratorer för flyttalsberäkningar. Den idag vanligast förekommande sådana är grafikkort (GPUer). En viktig del av Neko är Gather-Scatter (GS)-funktionen, som är den huvudsakliga kommunikations-funktionen mellan processer i programmet. I GS-funktionen kan race conditions uppstå då flera trådar skriver till samma minnesaddress samtidigt. Detta kan undvikas med atomic operations, men användande av dessa kan ha negativ inverkan på prestanda. I detta masterarbete utvecklades en alternativ implementation där element i GS-algoritmen grupperades på sådant sätt att alla operationer på samma element kommer i block. På så sätt kan de enkelt behandlas i sekvens och därmed undvika behovet av atomic operations. Inom ramen för masterarbetet implementerades en numerisk lösare av Poisson’s ekvation för GPUer. Optimering av koden genom att göra matrismultiplikationer i bulk genomfördes, och vidare genom utnyttjande av shared memory. Prestandan utvärderades på två olika datorsystem med en GTX1660 respektive två A100 GPUer. Enbart små skillnader sågs mellan de olika GS-implementationerna, med en svag fördel om ca 5% högre prestanda för den grupperade varianten i högupplösta domäner. Poisson-lösaren visade på höga prestandasiffror jämfört med cuBLAS-biblioteket. Neko CUDA Heterogeneous hardware GPU Gather-Scatter HPC CFD Neko CUDA Heterogena plattformar GPU Gather-Scatter Högprestandabe-räkningar Beräkningsströmningsdynamik Computer and Information Sciences Data- och informationsvetenskap
139	COMPARISON OF THE PERFORMANCE OF NVIDIA ACCELERATORS WITH SIMD AND ASSOCIATIVE PROCESSORS ON REAL-TIME APPLICATIONS Shaker, Alfred M. 27 July 2017 (has links) No description available. Computer Science
140	Parallelized QC-LDPC Decoder ona GPU : An evaluation targeting LDPC codes adhering to the 5G standard / Paralleliserad QC-LDPC-avkodare på en GPU : En utvärdering av LDPC-koder som följer 5G-standarden Hedlund, Olivia January 2024 (has links) Over the last ten years, there has been an incremental growth of mobile network data traffic. The evolution leading to the development of 5G stands as a testament to the increased demands for high speed networks. Channel coding plays a pivotal role in 5G networks, making it possible to recover messages from errors introduced when sent through the network. Channel decoding is however a time consuming task for a receiver, and making optimization to this process could therefore have a significant impact on receiver processing time. Low-Density Parity-Check (LDPC) codes are one of the channel coding schemes used in the 5G standard. These codes could benefit from parallel processing, making Graphics Processing Units (GPUs) with their parallel computation abilities a possible platform for effective LDPC decoding. In this thesis, our goal is to evaluate a GPU as a platform for 5G LDPC decoding. The LDPC codes adhering to the 5G standard belong to the Quasi-Cyclic LDPC (QC-LDPC) subclass. Optimizations targeting this subclass, as well as other optimization techniques, are implemented in our thesis project to promote fast execution times. A GPU-based decoder is evaluated against a Central Processing Unit (CPU)-based decoder, written with Compute Unified Device Architecture (CUDA) and C++ respectively. The functionally equivalent decoders implement the layered offset Min-Sum Algorithm (MSA) with early termination to decode messages. Execution time for the decoders were measured while varying message size, Signal-to-Noise Ratio (SNR) and maximum iterations. Additionaly, we evaluated the decoders with and without including early termination, and also evaluated the GPU-decoder when it was integrated into a MATLAB 5G channel simulator used by Tietoevry. The results from the experiments showed that the GPU-based decoder experienced up to 4.3 times faster execution than the CPU-based decoder for message sizes ranging from 3000-12000 bits. The GPU-based decoder however experienced a higher baseline execution time, making the CPU-based decoder faster for smaller message sizes. It was also concluded that the benefit of including early termination in the decoder generally outweighs the cost of additional processing time. / Under de senaste tio åren har det skett en gradvis ökning av datatrafik i mobilnät. Utvecklingen som lett till framtagandet av 5G står som ett bevis på de ökade kraven på höghastighetsnätverk. Kanalkodning spelar en avgörande roll i 5G- nätverk, vilket gör det möjligt att återställa meddelanden från fel som uppstår när de skickas genom nätverket. Kanalavkodning är dock en tidskrävande uppgift för en mottagare, och optimering av denna process kan därför ha en betydande inverkan på mottagarens exekveringstider. LDPC-koder är en av de kanalkoder som används i 5G-standarden. Dessa koder kan dra nytta av parallell bearbetning, vilket gör GPUs med deras parallella beräkningsförmåga till en möjlig plattform för effektiv LDPC- avkodning. I denna masteruppsats är vårt mål att utvärdera en GPU som en plattform för 5G LDPC- avkodning. LDPC-koder som följer 5G-standarden tillhör underklassen QC-LDPC. Optimeringar som riktar sig mot denna underklass, samt andra optimeringstekniker, implementeras i vårt avhandlingsprojekt för att främja snabba exekveringstider. En GPU-baserad avkodare utvärderas mot en CPU-baserad avkodare, skrivna med programspråken CUDA respektive C++. De funktionellt likvärdiga avkodarna implementerar den lagrade offset MSA med tidig terminering för att avkoda meddelanden. Exekveringstiden för avkodarna mättes medan meddelandestorlek, SNR och maximalt antal iterationer varierades. Vi utvärderade också avkodarna med och utan att inkludera tidig terminering, samt utvärderade GPU-avkodaren när den integrerats i en MATLAB 5G-kanalsimulator som används av Tietoevry. Resultaten från experimenten visade att den GPU-baserade avkodaren hade upp till 4.3 gånger snabbare exekvering än den CPU-baserade avkodaren för meddelandestorlekar mellan 3000 och 12000 bitar. Den GPU-baserade avkodaren hade dock en högre baslinje för exekveringstiden, vilket gjorde den CPU-baserade avkodaren snabbare för mindre meddelandestorlekar. Det konstaterades också att fördelen med att inkludera tidig terminering i avkodaren i allmänhet överväger kostnaden för ytterligare bearbetningstid. LDPC QC-LDPC Decoding Layered Offset MSA Early Termination 5G GPU CUDA LDPC QC-LDPC Avkodning Lagrad Offset MSA Tidig Terminering 5G GPU CUDA Computer Sciences Datavetenskap (datalogi)

Search results