Global ETD Search

91	Optimizing All-to-All and Allgather Communications on GPGPU Clusters Singh, Ashish Kumar 25 June 2012 (has links) No description available. Computer Science GPGPU HPC Infiniband Collective Communications MPI
92	Applying Source Level Auto-Vectorization to Aparapi Java Albert, Frank Curtis 19 June 2014 (has links) Ever since chip manufacturers hit the power wall preventing them from increasing processor clock speed, there has been an increased push towards parallelism for performance improvements. This parallelism comes in the form of both data parallel single instruction multiple data (SIMD) instructions, as well as parallel compute cores in both central processing units (CPUs) and graphics processing units (GPUs). While these hardware enhancements offer potential performance enhancements, programs must be re-written to take advantage of them in order to see any performance improvement Some lower level languages that compile directly to machine code already take advantage of the data parallel SIMD instructions, but often higher level interpreted languages do not. Java, one of the most popular programming languages in the world, still does not include support for these SIMD instructions. In this thesis, we present a vector library that implements all of the major SIMD instructions in functions that are accessible to Java through JNI function calls. This brings the benefits of general purpose SIMD functionality to Java. This thesis also works with the data parallel Aparapi Java extension to bring these SIMD performance improvements to programmers who use the extension without any additional effort on their part. Aparapi already provides programmers with an API that allows programmers to declare certain sections of their code parallel. These parallel sections are then run on OpenCL capable hardware with a fallback path in the Java thread pool to ensure code reliability. This work takes advantage of the knowledge of independence of the parallel sections of code to automatically modify the Java thread pool fallback path to include the vectorization library through the use of an auto-vectorization tool created for this work. When the code is not vectorizable the auto-vectorizer tool is still able to offer performance improvements over the default fallback path through an improved looped implementation that executes the same code but with less overhead. Experiments conducted by this work illustrate that for all 10 benchmarks tested the auto-vectorization tool was able to produce an implementation that was able to beat the default Aparapi fallback path. In addition it was found that this improved fallback path even outperformed the GPU implementation for several of the benchmarks tested. / Master of Science Auto-Vectorization Aparapi Java GPGPU Computing SIMD Parallelism Threaded
93	Massively Parallel Hidden Markov Models for Wireless Applications Hymel, Shawn 03 January 2012 (has links) Cognitive radio is a growing field in communications which allows a radio to automatically configure its transmission or reception properties in order to reduce interference, provide better quality of service, or allow for more users in a given spectrum. Such processes require several complex features that are currently being utilized in cognitive radio. Two such features, spectrum sensing and identification, have been implemented in numerous ways, however, they generally suffer from high computational complexity. Additionally, Hidden Markov Models (HMMs) are a widely used mathematical modeling tool used in various fields of engineering and sciences. In electrical and computer engineering, it is used in several areas, including speech recognition, handwriting recognition, artificial intelligence, queuing theory, and are used to model fading in communication channels. The research presented in this thesis proposes a new approach to spectrum identification using a parallel implementation of Hidden Markov Models. Algorithms involving HMMs are usually implemented in the traditional serial manner, which have prohibitively long runtimes. In this work, we study their use in parallel implementations and compare our approach to traditional serial implementations. Timing and power measurements are taken and used to show that the parallel implementation can achieve well over 100Ã speedup in certain situations. To demonstrate the utility of this new parallel algorithm using graphics processing units (GPUs), a new method for signal identification is proposed for both serial and parallel implementations using HMMs. The method achieved high recognition at -10 dB Eb/N0. HMMs can benefit from parallel implementation in certain circumstances, specifically, in models that have many states or when multiple models are used in conjunction. / Master of Science Hidden Markov Models CUDA GPU GPGPU Parallel Processing Signal Recognition
94	On the Complexity of Robust Source-to-Source Translation from CUDA to OpenCL Sathre, Paul Daniel 12 June 2013 (has links) The use of hardware accelerators in high-performance computing has grown increasingly prevalent, particularly due to the growth of graphics processing units (GPUs) as general-purpose (GPGPU) accelerators. Much of this growth has been driven by NVIDIA's CUDA ecosystem for developing GPGPU applications on NVIDIA hardware. However, with the increasing diversity of GPUs (including those from AMD, ARM, and Qualcomm), OpenCL has emerged as an open and vendor-agnostic environment for programming GPUs as well as other parallel computing devices such as the CPU (central processing unit), APU (accelerated processing unit), FPGA (field programmable gate array), and DSP (digital signal processor). The above, coupled with the broader array of devices supporting OpenCL and the significant conceptual and syntactic overlap between CUDA and OpenCL, motivated the creation of a CUDA-to-OpenCL source-to-source translator. However, there exist sufficient differences that make the translation non-trivial, providing practical limitations to both manual and automatic translation efforts. In this thesis, the performance, coverage, and reliability of a prototype CUDA-to-OpenCL source translator are addressed via extensive profiling of a large body of sample CUDA applications. An analysis of the sample body of applications is provided, which identifies and characterizes general CUDA source constructs and programming practices that obstruct our translation efforts. This characterization then led to more robust support for the translator, followed by an evaluation that demonstrated the performance of our automatically-translated OpenCL is on par with the original CUDA for a subset of sample applications when executed on the same NVIDIA device. / Master of Science Source Translation Clang CUDA OpenCL GPU GPGPU Compilers
95	Characterization and Exploitation of GPU Memory Systems Lee, Kenneth Sydney 25 October 2012 (has links) Graphics Processing Units (GPUs) are workhorses of modern performance due to their ability to achieve massive speedups on parallel applications. The massive number of threads that can be run concurrently on these systems allow applications which have data-parallel computations to achieve better performance when compared to traditional CPU systems. However, the GPU is not perfect for all types of computation. The massively parallel SIMT architecture of the GPU can still be constraining in terms of achievable performance. GPU-based systems will typically only be able to achieve between 40%-60% of their peak performance. One of the major problems affecting this effeciency is the GPU memory system, which is tailored to the needs of graphics workloads instead of general-purpose computation. This thesis intends to show the importance of memory optimizations for GPU systems. In particular, this work addresses problems of data transfer and global atomic memory contention. Using the novel AMD Fusion architecture, we gain overall performance improvements over discrete GPU systems for data-intensive applications. The fused architecture systems offer an interesting trade off by increasing data transfer rates at the cost of some raw computational power. We characterize the performance of different memory paths that are possible because of the shared memory space present on the fused architecture. In addition, we provide a theoretical model which can be used to correctly predict the comparative performance of memory movement techniques for a given data-intensive application and system. In terms of global atomic memory contention, we show improvements in scalability and performance for global synchronization primitives by avoiding contentious global atomic memory accesses. In general, this work shows the importance of understanding the memory system of the GPU architecture to achieve better application performance. / Master of Science Data Transfer Performance Modeling GPGPU APU GPU Memory Systems
96	Throughput-oriented analytical models for performance estimation on programmable hardware accelerators / Analyse de performance potentielle d'une simulation de QCD sur réseau sur processeur Cell et GPU Lai, Junjie 15 February 2013 (has links) Durant cette thèse, nous avons principalement travaillé sur deux sujets liés à l'analyse de la performance GPU (Graphics Processing Unit - Processeur graphique). Dans un premier temps, nous avons développé une méthode analytique et un outil d'estimation temporel (TEG) pour prédire les performances d'applications CUDA s’exécutant sur des GPUs de la famille GT200. Cet outil peut prédire les performances avec une précision approchant celle des outils précis au cycle près. Dans un second temps, nous avons développé une approche pour estimer la borne supérieure des performances d'une application GPU, en se basant sur l'analyse de l'application et de son code assembleur. Avec cette borne, nous connaissons la marge d'optimisation restante, et nous pouvons décider des efforts d'optimisation à fournir. Grâce à cette analyse, nous pouvons aussi comprendre quels paramètres sont critiques à la performance. / In this thesis work, we have mainly worked on two topics of GPU performance analysis. First, we have developed an analytical method and a timing estimation tool (TEG) to predict CUDA application's performance for GT200 generation GPUs. TEG can predict GPU applications' performance in cycle-approximate level. Second, we have developed an approach to estimate GPU applications' performance upper bound based on application analysis and assembly code level benchmarking. With the performance upper bound of an application, we know how much optimization space is left and can decide the optimization effort. Also with the analysis we can understand which parameters are critical to the performance. GPGPU Multi-coeurs Processeurs graphiques GPGPU CUDA Fermi GPU Kepler GPU Performance upper bound Performance Prediction Performance Analysis
97	Accélération matérielle pour l’imagerie sismique : modélisation, migration et interprétation / Hardware acceleration for seismic imaging : modeling, migration and interpretation Abdelkhalek, Rached 20 December 2013 (has links) La donnée sismique depuis sa conception (modélisation d’acquisitions sismiques), dans sa phase de traitement (prétraitement et migration) et jusqu’à son exploitation pour en extraire les informations géologiques pertinentes nécessaires à l’identification et l’exploitation optimale des réservoirs d’hydrocarbures (interprétation), génère un volume important de calculs. Nous montrons dans ce travail de thèse qu’à chacune de ces étapes l’utilisation de technologies accélératrices de type GPGPU permet de réduire radicalement les temps de calcul tout en restant dans une enveloppe de consommation électrique raisonnable. Nous présentons et analysons les éléments sous-jacents à ces performances. L’importance de l’utilisation de motifs d’accès mémoire adéquats est particulièrement mise en exergue étant donné que l’accès à la mémoire représente le principal goulot d’étranglement pour les algorithmes abordés. Nous reportons des facteurs d’accélération de l’ordre de 40 pour la modélisation sismique par résolution de l’équation d’onde par différences finies (brique de base pour la modélisation et l’imagerie sismique) et entre 8 et 113 pour le calcul d’attributs sismiques. Nous démontrons que l’utilisation d’accélérateurs matériels élargit considérablement le champ du possible, aussi bien en imagerie sismique (modélisation de nouveaux types d’acquisitions à grande échelle) qu’en interprétation (calcul d’attributs complexes sur station de travail, paramétrage interactif des calculs, etc.). / During the seismic imaging workflow, from seismic modeling to interpretation, processingseismic data requires a massive amount of computation. We show in this work that, at eachstage of this workflow, hardware accelerators such as GPUs may help reducing the time requiredto process seismic data while staying at reasonable energy consumption levels.In this work, the key programming considerations needed to achieve good performance are describedand discussed. The importance of adapted in-memory data access patterns is particularlyemphasised since data access is the main bottleneck for the considered algorithms. When usingGPUs, speedup ratios of 40× are achieved for FDTD seismic modeling, and 8× up to 113× forseismic attribute computation compared to CPUs. Calcul haute performance GPGPU FPGA Imagerie sismique RTM Attributs sismiques High Performance Computing GPGPU FPGA Seismic Imaging RTM Seismic Attributes
98	[en] SOLVING LARGE SYSTEMS OF LINEAR EQUATIONS ON MULTI-GPU CLUSTERS USING THE CONJUGATE GRADIENT METHOD IN OPENCLTM / [pt] RESOLUÇÃO DE SISTEMAS DE EQUAÇÕES LINEARES DE GRANDE PORTE EM CLUSTERS MULTI-GPU UTILIZANDO O MÉTODO DO GRADIENTE CONJUGADO EM OPENCLTM ANDRE LUIS CAVALCANTI BUENO 27 September 2013 (has links) [pt] Sistemas de equações lineares esparsos e de grande porte aparecem como resultado da modelagem de vários problemas nas engenharias. Dada sua importância, muitos trabalhos estudam métodos para a resolução desses sistemas. Esta dissertação explora o potencial computacional de múltiplas GPUs, utilizando a tecnologia OpenCL, com a finalidade de resolver sistemas de equações lineares de grande porte. Na metodologia proposta, o método do gradiente conjugado é subdivido em kernels que são resolvidos por múltiplas GPUs. Para tal, se fez necessário compreender como a arquitetura das GPUs se relaciona com a tecnologia OpenCL a fim de obter um melhor desempenho. / [en] The process of modeling problems in the engineering fields tends to produce substantiously large systems of sparse linear equations. Extensive research has been done to devise methods to solve these systems. This thesis explores the computational potential of multiple GPUs, through the use of the OpenCL tecnology, aiming to tackle the solution of large systems of sparse linear equations. In the proposed methodology, the conjugate gradient method is subdivided into kernels, which are delegated to multiple GPUs. In order to achieve an efficient method, it was necessary to understand how the GPUs’ architecture communicates with OpenCL. [pt] GPGPU [en] GPGPU [pt] COMPUTACAO DE ALTO DESEMPENHO [en] HIGH PERFORMANCE COMPUTING [pt] METODO DO GRADIENTE CONJUGADO [pt] MULTI-GPU [pt] OPENCL
99	Optimisation semi-infinie sur GPU pour le contrôle corps-complet de robots / GPU-based Semi-Infinite Optimization for Whole-Body Robot Control Chrétien, Benjamin 08 July 2016 (has links) Un robot humanoïde est un système complexe doté de nombreux degrés de liberté, et dont le comportement est sujet aux équations non linéaires du mouvement. Par conséquent, la planification de mouvement pour un tel système est une tâche difficile d'un point de vue calculatoire. Dans ce mémoire, nous avons pour objectif de développer une méthode permettant d'utiliser la puissance de calcul des GPUs dans le contexte de la planification de mouvement corps-complet basée sur de l'optimisation. Nous montrons dans un premier temps les propriétés du problème d'optimisation, et des pistes d'étude pour la parallélisation de ce dernier. Ensuite, nous présentons notre approche du calcul de la dynamique, adaptée aux architectures de calcul parallèle. Cela nous permet de proposer une implémentation de notre problème de planification de mouvement sur GPU: contraintes et gradients sont calculés en parallèle, tandis que la résolution du problème même se déroule sur le CPU. Nous proposons en outre une nouvelle paramétrisation des forces de contact adaptée à notre problème d'optimisation. Enfin, nous étudions l'extension de notre travail au contrôle prédictif. / A humanoid robot is a complex system with numerous degrees of freedom, whose behavior is subject to the nonlinear equations of motion. As a result, planning its motion is a difficult task from a computational perspective.In this thesis, we aim at developing a method that can leverage the computing power of GPUs in the context of optimization-based whole-body motion planning. We first exhibit the properties of the optimization problem, and show that several avenues can be exploited in the context of parallel computing. Then, we present our approach of the dynamics computation, suitable for highly-parallel processing architectures. Next, we propose a many-core GPU implementation of the motion planning problem. Our approach computes the constraints and their gradients in parallel, and feeds the result to a nonlinear optimization solver running on the CPU. Because each constraint and its gradient can be evaluated independently for each time interval, we end up with a highly parallelizable problem that can take advantage of GPUs. We also propose a new parametrization of contact forces adapted to our optimization problem. Finally, we investigate the extension of our work to model predictive control. Planification de mouvements Robotique humanoïde Calcul parallèle Gpgpu Optimisation non linéaire Motion planning Humanoid robotics Parallel computing Gpgpu Nonlinear optimization
100	Utilização de técnicas de GPGPU em sistema de vídeo-avatar. / Use of GPGPU techniques in a video-avatar system. Tsuda, Fernando 01 December 2011 (has links) Este trabalho apresenta os resultados da pesquisa e da aplicação de técnicas de GPGPU (General-Purpose computation on Graphics Processing Units) sobre o sistema de vídeo-avatar com realidade aumentada denominado AVMix. Com o aumento da demanda por gráficos tridimensionais interativos em tempo real cada vez mais próximos da realidade, as GPUs (Graphics Processing Units) evoluíram até o estado atual, como um hardware com alto poder computacional que permite o processamento de algoritmos paralelamente sobre um grande volume de dados. Desta forma, É possível usar esta capacidade para aumentar o desempenho de algoritmos usados em diversas áreas, tais como a área de processamento de imagens e visão computacional. A partir das pesquisas de trabalhos semelhantes, definiu-se o uso da arquitetura CUDA (Computer Unified Device Architecture) da Nvidia, que facilita a implementação dos programas executados na GPU e ao mesmo tempo flexibiliza o seu uso, expondo ao programador o detalhamento de alguns recursos de hardware, como por exemplo a quantidade de processadores alocados e os diferentes tipos de memória. Após a reimplementação das rotinas críticas ao desempenho do sistema AVMix (mapa de profundidade, segmentação e interação), os resultados mostram viabilidade do uso da GPU para o processamento de algoritmos paralelos e a importância da avaliação do algoritmo a ser implementado em relação a complexidade do cálculo e ao volume de dados transferidos entre a GPU e a memória principal do computador. / This work presents the results of research and application of GPGPU (General-Purpose computation on Graphics Processing Units) techniques on the video-avatar system with augmented reality called AVMix. With increasing demand for interactive three-dimensional graphics rendered in real-time and closer to reality, GPUs (Graphics Processing Units) evolved to the present state as a high-powered computing hardware enabled to process parallel algorithms over a large data set. This way, it is possible to use this capability to increase the performance of algorithms used in several areas, such as image processing and computer vision. From the research of similar work, it is possible to define the use of CUDA (Computer Unified Device Architecture) from Nvidia, which facilitates the implementation of the programs that run on GPU and at the same time flexibilize its use, exposing to the programmer some details of hardware such as the number of processors allocated and the different types of memory. Following the reimplementation of critical performance routines of AVMix system (depth map, segmentation and interaction), the results show the viability of using the GPU to process parallel algorithms in this application and the importance of evaluating the algorithm to be implemented, considering the complexity of the calculation and the volume of data transferred between the GPU and the computer\'s main memory. Augmented reality CUDA CUDA Depth map GPGPU GPGPU Mapa de profundidade Parallel processing Processamento paralelo Realidade aumentada Video-avatar Vídeo-avatar

Search results