Global ETD Search

121	Exploring Performance Portability for Accelerators via High-level Parallel Patterns Hou, Kaixi 27 August 2018 (has links) Nowadays, parallel accelerators have become prominent and ubiquitous, e.g., multi-core CPUs, many-core GPUs (Graphics Processing Units) and Intel Xeon Phi. The performance gains from them can be as high as many orders of magnitude, attracting extensive interest from many scientific domains. However, the gains are closely followed by two main problems: (1) A complete redesign of existing codes might be required if a new parallel platform is used, leading to a nightmare for developers. (2) Parallel codes that execute efficiently on one platform might be either inefficient or even non-executable for another platform, causing portability issues. To handle these problems, in this dissertation, we propose a general approach using parallel patterns, an effective and abstracted layer to ease the generating efficient parallel codes for given algorithms and across architectures. From algorithms to parallel patterns, we exploit the domain expertise to analyze the computational and communication patterns in the core computations and represent them in DSL (Domain Specific Language) or algorithmic skeletons. This preserves the essential information, such as data dependencies, types, etc., for subsequent parallelization and optimization. From parallel patterns to actual codes, we use a series of automation frameworks and transformations to determine which levels of parallelism can be used, what optimal instruction sequences are, how the implementation change to match different architectures, etc. Experiments show that our approaches by investigating a couple of important computational kernels, including sort (and segmented sort), sequence alignment, stencils, etc., across various parallel platforms (CPUs, GPUs, Intel Xeon Phi). / Ph. D. / Nowadays, parallel accelerators have become prominent and ubiquitous, e.g., multi-core CPUs, many-core GPUs (Graphics Processing Units) and Intel Xeon Phi. The performance gains from them can be as high as many orders of magnitude, attracting extensive interest from many scientific domains. However, the gains are closely followed by two main problems: (1) A complete redesign of existing codes might be required if a new parallel platform is used, leading to a nightmare for developers. (2) Parallel codes that execute efficiently on one platform might be either inefficient or even non-executable for another platform, causing portability issues. To handle these problems, in this dissertation, we propose a general approach using parallel patterns, an effective and abstracted layer to ease the generating efficient parallel codes for given algorithms and across architectures. From algorithms to parallel patterns, we exploit the domain expertise to analyze the computational and communication patterns in the core computations and represent them in DSL (Domain Specific Language) or algorithmic skeletons. This preserves the essential information, such as data dependencies, types, etc., for subsequent parallelization and optimization. From parallel patterns to actual codes, we use a series of automation frameworks and transformations to determine which levels of parallelism can be used, what optimal instruction sequences are, how the implementation change to match different architectures, etc. Experiments show that our approaches by investigating a couple of important computational kernels, including sort (and segmented sort), sequence alignment, stencils, etc., across various parallel platforms (CPUs, GPUs, Intel Xeon Phi). GPU AVX sort stencil wavefront pattern parallelism
122	Cu2cl: a Cuda-To-Opencl Translator for Multi- and Many-Core Architectures Martinez Arroyo, Gabriel Ernesto 02 September 2011 (has links) The use of graphics processing units (GPUs) in high-performance parallel computing continues to steadily become more prevalent, often as part of a heterogeneous system. For years, CUDA has been the de facto programming environment for nearly all general-purpose GPU (GPGPU) applications. In spite of this, the framework is available only on NVIDIA GPUs, traditionally requiring reimplementation in other frameworks in order to utilize additional multi- or many-core devices. On the other hand, OpenCL provides an open and vendor-neutral programming environment and run-time system. With implementations available for CPUs, GPUs, and other types of accelerators, OpenCL therefore holds the promise of a "write once, run anywhere" ecosystem for heterogeneous computing. Given the many similarities between CUDA and OpenCL, manually porting a CUDA application to OpenCL is almost straightforward, albeit tedious and error-prone. In response to this issue, we created CU2CL, an automated CUDA-to-OpenCL source-to-source translator that possesses a novel design and clever reuse of the Clang compiler framework. Currently, the CU2CL translator covers the primary constructs found in the CUDA Runtime API, and we have successfully translated several applications from the CUDA SDK and Rodinia benchmark suite. CU2CL's translation times are reasonable, allowing for many applications to be translated at once. The number of manual changes required after executing our translator on CUDA source is minimal, with some compiling and working with no changes at all. The performance of our automatically translated applications via CU2CL is on par with their manually ported counterparts. / Master of Science GPU Compilers CUDA OpenCL Source Translation Clang
123	Power Analysis and Prediction for Heterogeneous Computation Dutta, Bishwajit 12 February 2018 (has links) Power, performance, and cost dictate the procurement and operation of high-performance computing (HPC) systems. These systems use graphics processing units (GPUs) for performance boost. In order to identify inexpensive-to-acquire and inexpensive-to-operate systems, it is important to do a systematic comparison of such systems with respect to power, performance and energy characteristics with the end use applications. Additionally, the chosen systems must often achieve performance objectives without exceeding their respective power budgets, a task that is usually borne by a software-based power management system. Accurately predicting the power consumption of an application at different DVFS levels (or more generally, different processor configurations) is paramount for the efficient functioning of such a management system. This thesis intends to apply the latest in the state-of-the-art in green computing research to optimize the total cost of acquisition and ownership of heterogeneous computing systems. To achieve this we take a two-fold approach. First, we explore the issue of greener device selection by characterizing device power and performance. For this, we explore previously untapped opportunities arising from a special type of graphics processor --- the low-power integrated GPU --- which is commonly available in commodity systems. We compare the greenness (power, energy, and energy-delay product $rightarrow$ EDP) of the integrated GPU against a CPU running at different frequencies for the specific application domain of scientific visualization. Second, we explore the problem of predicting the power consumption of a GPU at different DVFS states via machine-learning techniques. Specifically, we perform statistically rigorous experiments to uncover the strengths and weaknesses of eight different machine-learning techniques (namely, ZeroR, simple linear regression, KNN, bagging, random forest, SMO regression, decision tree, and neural networks) in predicting GPU power consumption at different frequencies. Our study shows that a support vector machine-aided regression model (i.e., SMO regression) achieves the highest accuracy with a mean absolute error (MAE) of 4.5%. We also observe that the random forest method produces the most consistent results with a reasonable overall MAE of 7.4%. Our results also show that different models operate best in distinct regions of the application space. We, therefore, develop a novel, ensemble technique drawing the best characteristics of the various algorithms, which reduces the MAE to 3.5% and maximum error to 11% from 20% for SMO regression. / MS Power and Performance GPU Heterogeneous Computation Machine Learning
124	Sparse Matrix Belief Propagation Bixler, Reid Morris 11 May 2018 (has links) We propose sparse-matrix belief propagation, which executes loopy belief propagation in Markov random fields by replacing indexing over graph neighborhoods with sparse-matrix operations. This abstraction allows for seamless integration with optimized sparse linear algebra libraries, including those that perform matrix and tensor operations on modern hardware such as graphical processing units (GPUs). The sparse-matrix abstraction allows the implementation of belief propagation in a high-level language (e.g., Python) that is also able to leverage the power of GPU parallelization. We demonstrate sparse-matrix belief propagation by implementing it in a modern deep learning framework (PyTorch), measuring the resulting massive improvement in running time, and facilitating future integration into deep learning models. / Master of Science belief propagation inference GPU sparse matrix
125	Efficient Algorithms for Data Analytics in Geophysical Imaging Kump, Joseph Lee 14 June 2021 (has links) Modern sensing systems such as distributed acoustic sensing (DAS) can produce massive quantities of geophysical data, often in remote locations. This presents significant challenges with regards to data storage and performing efficient analysis. To address this, we have designed and implemented efficient algorithms for two commonly utilized techniques in geophysical imaging: cross-correlations, and multichannel analysis of surface waves (MASW). Our cross-correlation algorithms operate directly in the wavelet domain on compressed data without requiring a reconstruction of the original signal, reducing memory costs and improving scalabiliy. Meanwhile, our MASW implementations make use of MPI parallelism and GPUs, and present a novel problem for the GPU. / Master of Science / Modern sensor designs make it easier to collect large quantities of seismic vibration data. While this data can provide valuable insight, it is difficult to effectively store and perform analysis on such a high data volume. We propose a few new, general-purpose algorithms that enable speedy use of two common methods in geophysical modeling and data analytics: crosscorrelation, which provides a measure of similarity between signals; and multichannel analysis of surface waves, which is a seismic imaging technique. Our algorithms take advantage of hardware and software typically available on modern computers, and the mathematical properties of these two methods. Algorithms Wavelets Cross-correlations MASW GPU
126	Multi-GPU Load Balancing for Simulation and Rendering Hagan, Robert Douglas 04 August 2011 (has links) GPU computing can significantly improve performance by taking advantage of massive parallelism of GPUs for data parallel applications. Computation in visualization applications is suitable for parallelization on the GPU, which can improve performance and interactivity in these applications. If used effectively, multiple GPUs can lead to a significant speedup over a single GPU. However, the use of multiple GPUs requires memory management, scheduling, and load balancing to ensure that a program takes full advantage of available processors. This work presents methods for data-driven and dynamic multi-GPU load balancing using a pipelined approach and a framework for use with different applications. Data-driven load balancing can improve utilization for applications by taking into account past performance for different combinations of input parameters. The dynamic load balancing method based on buffer fullness can adjust to workload changes at runtime to gain an additional performance improvement. This work provides a framework for load balancing to account for differing characteristics of applications. Implementation of a multi-GPU data structure allows for use of these load balancing methods in the framework. The effectiveness of the framework is demonstrated with performance results from interactive visualization that shows a significant speedup due to load balancing. / Master of Science Load Balancing Multi-GPU Computing Rendering Simulation
127	Nanowires de InP: cálculo do espectro de absorção via método k.p / InP nanowires: absorption spectrum calculation via k.p method Campos, Tiago de 25 July 2013 (has links) Nos últimos anos, os avanços nas técnicas de crescimento de semicondutores permitiram a fabricação de nanoestruturas isoladas de alta qualidade e com confinamento radial. Essas estruturas quase unidimensionais, conhecidas como nanowires (NWs) têm aplicações tecnológicas vastas, tais como nano sensores químicos e biológicos, foto-detectores e lasers. Seu uso em aplicações tecnológicas requer a compreensão de características óticas e eletrônicas e um estudo teórico mais profundo se faz necessário. O objetivo desse estudo e calcular teoricamente o poder de absorção para NWs de InP e comparar os resultados para as fases cristalinas zincblende (ZB) e wurtzita (WZ) nas suas direções de crescimento equivalentes. Usamos neste estudo a formulação do método k.p que descreve as duas fases cristalinas em um mesmo Hamiltoniano, a aproximação da função envelope e a expansão em ondas planas. O poder de absorção foi calculado a partir das transições entre as bandas de valência e condução através da regra de ouro de Fermi. Mesmo o método k.p sendo o menos custoso computacionalmente, quando comparado com seus correspondentes ab initio, o tamanho das matrizes envolvidas nos cálculos pode ultrapassar a barreira dos giga elementos. Para lidar com essas matrizes, foi implementado um método de resolução de sistemas lineares iterativo, o LOBPCG, utilizando o poder de processamento disponível nas placas gráficas atuais. O novo modo de resolução apresentou ganhos consideráveis em relação ao desempenho observado com os métodos de diagonalização diretos em testes com confinamento em uma única direção. A falta de um pré-condicionador adequado limita o seu uso em NWs. Os cálculos de absorção para NWs na fase ZB apresentaram uma anisotropia em seu espectro de absorção de mais de 90%, enquanto os na fase WZ apresentaram dois regimes distintos de anisotropia, governados pelo aparecimento de um estado oticamente proibido no topo da banda de valência. Em suma, os resultados obtidos com o modelo teórico proposto nesse estudo apresentam as propriedades óticas reportadas na literatura, inclusive o estado oticamente proibido observado em outros sistemas na fase WZ com um alto confinamento quântico. / In recent years, the advances of growth techniques allowed the fabrication of high quality single nanostructures with quantum confinement along lateral directions. These quasi one-dimensional structures known as nanowires (NWs) have vasts technological applications, such as biological and chemical nanosensors, photo detectors and lasers. The applications involving NWs require the comprehension of their optical and electronic properties and, therefore, a deep theoretical understanding should be pursued. The aim of this study is to provide optical absorption theoretical calculations for InP NWs, comparing the results for zincblende (ZB) and wurtzite (WZ) crystal phases, in their equivalent growth directions. We use the k.p method formulation that allow the description of both structures with the same Hamiltonian, the envelope function approximation and the plane wave expansion. The absorption power was calculated for transitions between valence and conduction bands using Fermis Golden Rule. Although the k.p method demands less computational effort, when compared to ab initio calculations, the k.p matrices can break the giga elements barrier. To deal with these matrices, we implemented an linear system solver method, the LOBPCG, using the processing power available in current GPUs. The new resolution method showed a considerable gain comparing the performance of direct diagonalization methods, when tested in systems with confinement in one direction. The lack of an adequate preconditioner limits its use in NWs. The absorption spectra calculations for ZB NWs presented a 90% plus anisotropy, whilst WZ NWs have two distinct regimes, ruled by the presence of an optically forbidden state at valence band maximum. In summary, the results obtained with the theoretical model in this study are in great agreement with optical properties reported in the literature, including the optically forbidden state observed in other WZ systems with high quantum confinement. k.p method Absorção Absorption GPU GPU. InP nanowires Método k.p Nanofios de InP
128	Algorithmes sur GPU de visualisation et de calcul pour des maillages non-structurés / Algorithms on the GPU for visualization and computations on unstructured grids Buatois, Luc 16 May 2008 (has links) De nombreux domaines utilisent à présent de nouveaux types de grilles composées de polyèdres arbitraires, autrement dit des grilles fortement non-structurées. La problématique de cette thèse concerne la définition de nouveaux outils de visualisation et de calcul sur de telles grilles. Pour la visualisation, cela pose à la fois le problème du stockage et de l'adaptativité des algorithmes à une géométrie et une topologie variables. Pour le calcul, cela pose le problème de la résolution de grands systèmes linéaires creux non-structurés. Pour aborder ces problèmes, l'augmentation incessante de la puissance de calcul parallèle des processeurs graphiques nous fournit de nouveaux outils. Toutefois, l'utilisation de ces GPU nécessite de définir de nouveaux algorithmes adaptés aux modèles de programmation parallèle qui leur sont spécifiques. Nos contributions sont les suivantes : (1) Une méthode générique de visualisation tirant partie de la puissance de calcul des GPU pour extraire des isosurfaces à partir de grandes grilles fortement non-structurées. (2) Une méthode de classification de cellules qui permet d'accélérer l'extraction d'isosurfaces grâce à une pré-sélection des seules cellules intersectées. (3) Un algorithme d'interpolation temporelle d'isosurfaces. Celui-ci permet de visualiser de manière continue dans le temps l'évolution d'isosurfaces. (4) Un algorithme massivement parallèle de résolution de grands systèmes linéaires non-structurés creux sur le GPU. L'originalité de celui-ci concerne son adaptation à des matrices de motif arbitraire, ce qui le rend applicable à n'importe quel système creux, dont ceux issus de maillages fortement non-structurés / This thesis proposes new tools for visualization and computation on strongly unstructured grids. Visualization of such grids that have variable geometry and topology, poses the problem of how to store data and how algorithms could handle such variability. Doing computations on such grids poses the problem of solving large sparse unstructured linear systems. The ever-growing parallel power of GPUs makes them more and more valuable for handling theses tasks. However, using GPUs calls for defining new algorithms highly adapted to their specific programming model. Most recent algorithms for Geometry Processing or Computational Fluid Dynamics (CFD) are using new types of grids made of arbitrary polyhedra, in other words strongly unstructured grids. In case of CFD simulations, these grids can be mapped with scalar or vector fields representing physical properties (for example : density, porosity, permeability). Our contributions are: (1) An efficient generic visualization method that uses GPU's power to accelerate isosurface extraction for large unstructured grids. (2) An adaptative cell classification method that accelerates isosurface extraction by pre-selecting only intersected cells. (3) An efficient algorithm for temporal interpolation of isosurfaces. This algrithm helps to visualize in a continuous maner the evolution of isosurfaces through time. (4) A massively parallel algorithm for solving large sparse unstructured linear systems on the GPU. Its originality comes from its adaptation to sparse matrices with random pattern, which enables to solve any sparse linear system, thus the ones that come from strongly unstructured grids Maillages non-structurés Extraction d'isosurfaces Solveurs numériques creux GPU Unstructured grids Sparse numerical solvers Isosurface extraction GPU
129	Aceleração por GPU de serviços em sistemas robóticos focado no processamento de tempo real de nuvem de pontos 3D / GPU Acceleration of robotic systems services focused in real-time processing of 3D point clouds Christino, Leonardo Milhomem Franco 03 February 2016 (has links) O projeto de mestrado, denominado de forma abreviada como GPUServices, se insere no contexto da pesquisa e do desenvolvimento de métodos de processamento de dados de sensores tridimensionais aplicados a robótica móvel. Tais métodos serão chamados de serviços neste projeto e incluem algoritmos de pré-processamento de nuvens de pontos 3D com segmentação dos dados, a separação e identificação de zonas planares (chão, vias), e detecção de elementos de interesse (bordas, obstáculos). Devido à grande quantidade de dados a serem tratados em um curto espaço de tempo, esses serviços utilizam processamento paralelo por GPU para realizar o processamento parcial ou completo destes dados. A área de aplicação em foco neste projeto visa prover serviços para um sistema ADAS: veículos autônomos e inteligentes, forçando-os a se aproximarem de um sistema de processamento em tempo real devido ao contexto de direção autônoma. Os serviços são divididos em etapas de acordo com a metodologia do projeto, mas sempre buscando a aceleração com o uso de paralelismo inerente: O pré-projeto consiste de organizar um ambiente que seja capaz de coordenar todas as tecnologias utilizadas e que explore o paralelismo; O primeiro serviço tem a responsabilidade de extrair inteligentemente os dados do sensor que foi usado pelo projeto (Sensor laser Velodyne de múltiplos feixes), que se mostra necessário devido à diversos erros de leitura e ao formato de recebimento, fornecendo os dados em uma estrutura matricial; O segundo serviço em cooperação com o anterior corrige a desestabilidade espacial do sensor devido à base de fixação não estar perfeitamente paralela ao chão e devido aos amortecimentos do veículo; O terceiro serviço separa as zonas semânticas do ambiente, como plano do chão, regiões abaixo e acima do chão; O quarto serviço, similar ao anterior, realiza uma pré-segmentação das guias da rua; O quinto serviço realiza uma segmentação de objetos do ambiente, separando-os em blobs; E o sexto serviço utiliza de todos os anteriores para a detecção e segmentação das guias da rua. Os dados recebidos pelo sensor são na forma de uma nuvem de pontos 3D com grande potencial de exploração do paralelismo baseado na localidade das informações. Porém, sua grande dificuldade é a grande taxa de dados recebidos do sensor (em torno de 700.000 pontos/seg.), sendo esta a motivação deste projeto: usar todo o potencial do sensor de forma eficiente ao usar o paralelismo de programação GPU, disponibilizando assim ao usuário serviços de tratamento destes dados. / The master\'s project, abbreviated hence forth as GPUServices, fits in the context of research and development of three-dimensional sensor data processing methods applied to mobile robotics. Such methods will be called services in this project, which include a 3D point cloud preprocessing algorithms with data segmentation, separation and identification of planar areas (ground track), and also detecting elements of interest (borders, barriers). Due to the large amount of data to be processed in a short time, these services should use parallel processing, using the GPU to perform partial or complete processing of these data. The application area in focus in this project aims to provide services for an ADAS system: autonomous and intelligent vehicles, forcing them to get close to a real-time processing system due to the autonomous direction of context.The services are divided into stages according to the project methodology, but always striving for acceleration using inherent parallelism: The pre-project consists of organizing an environment for development that is able to coordinate all used technologies, to exploit parallelism and to be integrated to the system already used by the autonomous car; The first service has a responsibility to intelligently extract sensor data that will be used by the project (Laser sensor Velodyne multi-beam), it appears necessary because of the many reading errors and the receiving data format, hence providing data in a matrix structure; The second service, in cooperation with the above, corrects the spatial destabilization due to the sensor fixing base not perfectly parallel to the ground and due to the damping of the vehicle; The third service separates the environment into semantics areas such as ground plane and regions below and above the ground; The fourth service, similar to the above, performs a pre-segmentation of street cruds; The fifth service performs an environmental objects segmentation, separating them into blobs; The sixth service uses all prior to detection and segmentation of street guides.The received sensor data is structured in the form of a cloud of points. They allow processing with great potential for exploitation of parallelism based on the location of the information. However, its major difficulty is the high rate of data received from the sensor (around 700,000 points/sec), and this gives the motivation of this project: to use the full potential of sensor to efficiently use the parallelism of GPU programming, therefore providing data processing services to the user, providing services that helps and make the implementation of ADAS systems easier and/or faster. Autonomous vehicle CUDA CUDA GPU GPU Robótica Robotics ROS ROS Veículo autônomo.
130	Accélération des calculs en Chimie théorique : l'exemple des processeurs graphiques / Accelerating Computations in Theoretical Chemistry : The Example of Graphic Processors Rubez, Gaëtan 06 December 2018 (has links) Nous nous intéressons à l'utilisation de la technologie manycore des cartes graphiques dans le cadre de la Chimie théorique. Nous soutenons la nécessité pour ce domaine d'être capable de tirer profit de cette technologie. Nous montrons la faisabilité et les limites de l'utilisation de cartes graphiques en Chimie théorique par le portage sur GPU de deux méthodes de calcul en modélisation moléculaire. Ces deux méthodes n’intégrerons ultérieurement au programme de docking moléculaire AlgoGen. L'accélération et la performance énergétique ont été examinées au cours de ce travail.Le premier programme NCIplot implémente la méthodologie NCI qui permet de détecter et de caractériser les interactions non-covalentes dans un système chimique. L'approche NCI se révèle être idéale pour l'utilisation de cartes graphiques comme notre analyse et nos résultats le montrent. Le meilleur portage que nous avons obtenu, a permis de constater des facteurs d'accélération allant jusqu'à 100 fois plus vite par rapport au programme NCIplot. Nous diffusons actuellement librement notre portage GPU : cuNCI.Le second travail de portage sur GPU se base sur GAMESS qui est un logiciel complexe de portée internationale implémentant de nombreuses méthodes quantiques. Nous nous sommes intéressés à la méthode combinée DFTB/FMO/PCM pour le calcul quantique de l'énergie potentielle d'un complexe. Nous sommes intervenus dans la partie du programme calculant l'effet du solvant. Ce cas s'avère moins favorable à l'utilisation de cartes graphiques, cependant nous avons su obtenir une accélération. / In this research work we are interested in the use of the manycore technology of graphics cards in the framework of approaches coming from the field of Theoretical Chemistry. We support the need for Theoretical Chemistry to be able to take advantage of the use of graphics cards. We show the feasibility as well as the limits of the use of graphics cards in the framework of the theoretical chemistry through two usage of GPU on different approaches.We first base our research work on the GPU implementation of the NCIplot program. The NCIplot program has been distributed since 2011 by Julia CONTRERAS-GARCIA implementing the NCI methodology published in 2010. The NCI approach is proving to be an ideal candidate for the use of graphics cards as demonstrated by our analysis of the NCIplot program, as well as the performance achieved by our GPU implementations. Our best implementation (VHY) shows an acceleration factors up to 100 times faster than the NCIplot program. We are currently freely distributing this implementation in the cuNCI program.The second GPU accelerated work is based on the software GAMESS-US, a free competitor of GAUSSIAN. GAMESS is an international software that implements many quantum methods. We were interested in the simultaneous use of DTFB, FMO and PCM methods. The frame is less favorable to the use of graphics cards however we have been able to accelerate the part carried by two K20X graphics cards. Docking Nci Dftb Drug design Parallel computing Gpu Docking Nci Dftb Drug design Parallel computing Gpu

Search results