Global ETD Search

411	OPTIMIZATIONS FOR N-BODY PROBLEMS ON HETEROGENOUS SYSTEMS Jianqiao Liu (6636020) 14 May 2019 (has links) <div><div>N-body problems, such as simulating the motion of stars in a galaxy and evaluating the spatial statistics through n-point correlation function, are popularly solved. The naive approaches to n-body problems are typically O(n^2) algorithms. Tree codes take advantages of the fact that a group of bodies can be skipped or approximated as a union if their distance is far away from one body’s sight. It reduces the complexity from O(n^2) to O(n*lgn). However, tree codes rely on pointer chasing and have massive branch instructions. These are highly irregular and thus prevent tree codes from being easily parallelized. </div><div><br></div><div>GPU offers the promise of massive, power-efficient parallelism. However, exploiting this parallelism requires the code to be carefully structured to deal with the limitations of the SIMT execution model. This dissertation focusses on optimizations for n-body problems on the heterogeneous system. A general inspector-executor based framework is proposed to automatically schedule GPU threads to achieve high performance. Essentially, the framework lets the GPU execute partial of the tree codes and profile threads behaviors, then it assigns the CPU to re-organize these threads to minimize the divergence before executing the remaining portion of the traversals on the GPU. We apply this framework to six tree traversal algorithms, achieving significant speedups over optimized GPU code that does not perform application-specific scheduling. Further, we show that in many cases, our hybrid approach is able to deliver better performance even than GPU code that uses hand tuned, application-specific scheduling. </div><div> </div><div>For large scale input, ChaNGa is the best-of-breed n-body platform. It uses an asymp-totically-efficient tree traversal strategy known as a dual-tree walk to quickly provide an accurate simulation result. On GPUs, ChaNGa uses a hybrid strategy where the CPU performs the tree walk to determine which bodies interact while the GPU performs the force computation. In this dissertation, we show that a highly optimized single-tree walk approach is able to achieve better GPU performance by significantly accelerating the tree walk and reducing CPU/GPU communication. Our experiments show that this new design can achieve a 8.25x speedup over baseline ChaNGa using one node, one process per node configuration. We also point out that ChaNGa's implementation doesn't satisfy the inclusion condition so that GPU-centric remote tree walk doesn't perform well.</div></div> Computer Engineering optimization algorithms GPU Heterogeneous system N-body problems Tree traversal
412	Using Graphical Processors to Implement Radio Base Station Control Plane Functions / Implementera radiobasstationers kontrollplans funktioner med grafikprocessor Ringman, Noak January 2019 (has links) Today more devices are being connected to the Internet via mobile networks. With more devices in mobile networks, the workload on radio base stations increases. Radio base stations must be energy efficient and cheap which makes high-performance central processing units (CPUs) a bad alternative to meet the increasing workload. An alternative could be a graphics processing unit (GPU) which have a different hardware architecture more suitable for data parallel problems. This thesis has investigated the parallelisation possibilities in the user-equipment handling part of radio base stations, and the aim was to use a GPU to take advantage of the parallelism. The investigation found a mixed pipeline and data parallelism in user-equipment handling. A parallelism suitable for a graphics processing unit (GPU) execution. The tasks which handle user-equipment were divided into smaller communication-free sub-tasks. Sub-task batches of user-equipment were collected and offloaded to a GPU. A peak throughput gain of 62.2 times over the single-threaded CPU was achieved, but with an impact on latency with more than a magnitude. The latency was for all workloads at least 1.24 higher for the GPU implementations compared to the CPU implementations. A radio base station with many more user-equipment than the once existing today was simulated. For this radio base station, a gain of 14.0 times the single-threaded CPU was achieved, while the latency increased by 2.4 times. To really make use of a GPU implementation the number of user-equipment, the load, must be higher than in existing radio base stations today. Mobile networks Radio base stations Control plane GPU GPGPU CUDA OpenCL Computer Engineering Datorteknik
413	Implementations of the FFT algorithm on GPU Sreehari, Ambuluri January 2012 (has links) The fast Fourier transform (FFT) plays an important role in digital signal processing (DSP) applications, and its implementation involves a large number of computations. Many DSP designers have been working on implementations of the FFT algorithms on different devices, such as central processing unit (CPU), Field programmable gate array (FPGA), and graphical processing unit (GPU), in order to accelerate the performance. We selected the GPU device for the implementations of the FFT algorithm because the hardware of GPU is designed with highly parallel structure. It consists of many hundreds of small parallel processing units. The programming of such a parallel device, can be done by a parallel programming language CUDA (Compute Unified Device Architecture). In this thesis, we propose different implementations of the FFT algorithm on the NVIDIA GPU using CUDA programming language. We study and analyze the different approaches, and use different techniques to accelerate the computations of the FFT. We also discuss the results and compare different approaches and techniques. Finally, we compare our best cases of results with the CUFFT library, which is a specific library to compute the FFT on NVIDIA GPUs. GPU CUDA FFT Annan elektroteknik och elektronik
414	GigaVoxels : un pipeline de rendu basé Voxel pour l'exploration efficace de scènes larges et détaillées / GigaVoxels : a Voxel-Based Rendering Pipeline For Efficient Exploration Of Large and Detailed Scenes Crassin, Cyril 12 July 2011 (has links) Dans cette thèse, nous présentons une nouvelle approche efficace pour le rendu de scènes vastes et d'objets détaillés en temps réel. Notre approche est basée sur une nouvelle représentation pré-filtrée et volumique de la géométrie et un lancer de cone basé-voxel qui permet un rendu précis et haute performance avec une haute qualité de filtrage de géométries très détaillées. Afin de faire de cette représentation voxel une primitive de rendu standard pour le temps-réel, nous proposons une nouvelle approche basée sur les GPUs conçus entièrement pour passer à l'échelle et supporter ainsi le rendu des volumes de données très volumineux. Notre système permet d'atteindre des performances de rendu en temps réel pour plusieurs milliards de voxels. Notre structure de données exploite le fait que dans les scènes CG, les détails sont souvent concentrées sur l'interface entre l'espace libre et des grappes de densité et montre que les modèles volumétriques pourrait devenir une alternative intéressante en tant que rendu primitif pour les applications temps réel. Dans cet esprit, nous permettons à un compromis entre qualité et performances et exploitons la cohérence temporelle. Notre solution est basée sur une représentation hiérarchiques des données adaptées en fonction de la vue actuelle et les informations d'occlusion, couplé à un algorithme de rendu par lancer de rayons efficace. Nous introduisons un mécanisme de cache pour le GPU offrant une pagination très efficace de données dans la mémoire vidéo et mis en œuvre comme un processus data-parallel très efficace. Ce cache est couplé avec un pipeline de production de données capable de charger dynamiquement des données à partir de la mémoire centrale, ou de produire des voxels directement sur le GPU. Un élément clé de notre méthode est de guider la production des données et la mise en cache en mémoire vidéo directement à partir de demandes de données et d'informations d'utilisation émises directement lors du rendu. Nous démontrons notre approche avec plusieurs applications. Nous montrons aussi comment notre modèle géométrique pré-filtré et notre lancer de cones approximé peuvent être utilisés pour calculer très efficacement divers effets de flou ainsi d'éclairage indirect en temps réel. / In this thesis, we present a new approach to efficiently render large scenes and detailed objects in real-time. Our approach is based on a new volumetric pre-filtered geometry representation and an associated voxel-based approximate cone tracing that allows an accurate and high performance rendering with high quality filtering of highly detailed geometry. In order to bring this voxel representation as a standard real-time rendering primitive, we propose a new GPU-based approach designed to entirely scale to the rendering of very large volumetric datasets. Our system achieves real-time rendering performance for several billion voxels. Our data structure exploits the fact that in CG scenes, details are often concentrated on the interface between free space and clusters of density and shows that volumetric models might become a valuable alternative as a rendering primitive for real-time applications. In this spirit, we allow a quality/performance trade-off and exploit temporal coherence. Our solution is based on an adaptive hierarchical data representation depending on the current view and occlusion information, coupled to an efficient ray-casting rendering algorithm. We introduce a new GPU cache mechanism providing a very efficient paging of data in video memory and implemented as a very efficient data-parallel process. This cache is coupled with a data production pipeline able to dynamically load or produce voxel data directly on the GPU. One key element of our method is to guide data production and caching in video memory directly based on data requests and usage information emitted directly during rendering. We demonstrate our approach with several applications. We also show how our pre-filtered geometry model and approximate cone tracing can be used to very efficiently achieve blurry effects and real-time indirect lighting. Voxels Rendu Matériel graphique Visibilité Scènes volumiques Voxels Rendering GPU Visibility Volume scenes 510
415	The Study of Energy Consumption of Acceleration Structures for Dynamic CPU and GPU Ray Tracing Chang, Chen Hao Jason 08 January 2007 (has links) Battery life has been the slowest growing resource on mobile systems for several decades. Although much work has been done on designing new chips and peripherals that use less energy, there has not been much work on reducing energy consumption by removing energy intensive tasks from graphics algorithms. In our work, we focus on energy consumption of the ray tracing task because it is a resource-intensive, global-illumination algorithm. We focus our effort on ray tracing dynamic scenes, thus we concentrate on identifying the major elements determining the energy consumption of acceleration structures. We believe acceleration structures are critical in reducing energy consumption because they need to be built inexpensively, but must also be complex enough to boost rendering speed. We conducted tests on a Pentium 1.6 GHz laptop with GeForce Go 6800 GPU. In our experiments, we investigated various elements that modify the acceleration structure build algorithm, and we compared the energy usage of CPU and GPU rendering with different acceleration structures. Furthermore, the energy per frame when ray tracing dynamic scenes was gathered and compared to identify the best acceleration structure that provides a good balance between building energy consumption and rendering energy consumption. We found the bounding volume hierarchy to be the best acceleration structure when rendering dynamic scenes with the GPU on our test system. A bounding volume hierarchy is not the most inexpensive structure to build, but it can be rendered cheaply on the GPU while introducing acceptable energy overhead when rebuilding. In addition, we found the fastest algorithm was also the most inexpensive in terms of energy consumption. We propose an energy model based on this finding. Battery Energy Ray Tracing GPU Dynamic Scene Acceleration Structure Computer graphics Energy consumption Ray tracing
416	Exploiting Data Sparsity In Covariance Matrix Computations on Heterogeneous Systems Charara, Ali 24 May 2018 (has links) Covariance matrices are ubiquitous in computational sciences, typically describing the correlation of elements of large multivariate spatial data sets. For example, covari- ance matrices are employed in climate/weather modeling for the maximum likelihood estimation to improve prediction, as well as in computational ground-based astronomy to enhance the observed image quality by filtering out noise produced by the adap- tive optics instruments and atmospheric turbulence. The structure of these covariance matrices is dense, symmetric, positive-definite, and often data-sparse, therefore, hier- archically of low-rank. This thesis investigates the performance limit of dense matrix computations (e.g., Cholesky factorization) on covariance matrix problems as the number of unknowns grows, and in the context of the aforementioned applications. We employ recursive formulations of some of the basic linear algebra subroutines (BLAS) to accelerate the covariance matrix computation further, while reducing data traffic across the memory subsystems layers. However, dealing with large data sets (i.e., covariance matrices of billions in size) can rapidly become prohibitive in memory footprint and algorithmic complexity. Most importantly, this thesis investigates the tile low-rank data format (TLR), a new compressed data structure and layout, which is valuable in exploiting data sparsity by approximating the operator. The TLR com- pressed data structure allows approximating the original problem up to user-defined numerical accuracy. This comes at the expense of dealing with tasks with much lower arithmetic intensities than traditional dense computations. In fact, this thesis con- solidates the two trends of dense and data-sparse linear algebra for HPC. Not only does the thesis leverage recursive formulations for dense Cholesky-based matrix al- gorithms, but it also implements a novel TLR-Cholesky factorization using batched linear algebra operations to increase hardware occupancy and reduce the overhead of the API. Performance reported of the dense and TLR-Cholesky shows many-fold speedups against state-of-the-art implementations on various systems equipped with GPUs. Additionally, the TLR implementation gives the user flexibility to select the desired accuracy. This trade-off between performance and accuracy is, currently, a well-established leading trend in the convergence of the third and fourth paradigm, i.e., HPC and Big Data, when moving forward with exascale software roadmap. data sparse Hierarchical covariance matrix GPU tile low-rank Dense Linear Algebra
417	Analise dos efeitos de falhas transientes no conjunto de banco de registradores em unidades gráficas de processamento / Evaluation of transient fault effect in the register files of graphics processing units Nedel, Werner Mauricio January 2015 (has links) Unidades gráficas de processamento, mais conhecidas como GPUs (Graphics Processing Unit), são dispositivos que possuem um grande poder de processamento paralelo com respectivo baixo custo de operação. Sua capacidade de simultaneamente manipular grandes blocos de memória a credencia a ser utilizada nas mais variadas aplicações, tais como processamento de imagens, controle de tráfego aéreo, pesquisas acadêmicas, dentre outras. O termo GPGPUs (General Purpose Graphic Processing Unit) designa o uso de GPUs utilizadas na computação de aplicações de uso geral. A rápida proliferação das GPUs com ao advento de um modelo de programação amigável ao usuário fez programadores utilizarem essa tecnologia em aplicações onde confiabilidade é um requisito crítico, como aplicações espaciais, automotivas e médicas. O crescente uso de GPUs nestas aplicações faz com que novas arquiteturas deste dispositivo sejam propostas a fim de explorar seu alto poder computacional. A arquitetura FlexGrip (FLEXible GRaphIcs Processor) é um exemplo de GPGPU implementada em FPGA (Field Programmable Gate Array), sendo compatível com programas implementados especificamente para GPUs, com a vantagem de possibilitar a customização da arquitetura de acordo com a necessidade do usuário. O constante aumento da demanda por tecnologia fez com que GPUs de última geração sejam fabricadas em tecnologias com processo de fabricação de até 28nm, com frequência de relógio de até 1GHz. Esse aumento da frequência de relógio e densidade de transistores, combinados com a redução da tensão de operação, faz com que os transistores fiquem mais suscetíveis a falhas causadas por interferência de radiação. O modelo de programação utilizado pelas GPUs faz uso de constantes acessos a memórias e registradores, tornando estes dispositivos sensíveis a perturbações transientes em seus valores armazenados. Estas perturbações são denominadas Single Event Upset (SEU), ou bit-flip, e podem resultar em erros no resultado final da aplicação. Este trabalho tem por objetivo apresentar um modelo de injeção de falhas transientes do tipo SEU nos principais bancos de registradores da GPGPU Flexgrip, avaliando o comportamento da execução de diferentes algoritmos em presença de SEUs. O impacto de diferentes distribuições de recursos computacionais da GPU em sua confiabilidade também é abordado. Resultados podem indicar maneiras eficientes de obter-se confiabilidade explorando diferentes configurações de GPUs. / Graphic Process Units (GPUs) are specialized massively parallel units that are widely used due to their high computing processing capability with respective lower costs. The ability to rapidly manipulate high amounts of memory simultaneously makes them suitable for solving computer-intensive problems, such as analysis of air traffic control, academic researches, image processing and others. General-Purpose Graphic Processing Units (GPGPUs) designates the use of GPUs in applications commonly handled by Central Processing Units (CPUs). The rapid proliferation of GPUs due to the advent of significant programming support has brought programmers to use such devices in safety critical applications, like automotive, space and medical. This crescent use of GPUs pushed developers to explore its parallel architecture and proposing new implementations of such devices. The FLEXible GRaphics Processor (FlexGrip) is an example of GPGPU optimized for Field Programmable Arrays (FPGAs) implementation, fully compatible with GPU’s compiled programs. The increasing demand for computational has pushed GPUs to be built in cuttingedge technology down to 28nm fabrication process for the latest NVIDIA devices with operating clock frequencies up to 1GHz. The increases in operating frequencies and transistor density combined with the reduction of voltage supplies have made transistors more susceptible to faults caused by radiation. The program model adopted by GPUs makes constant accesses to its memories and registers, making this device sensible to transient perturbations in its stored values. These perturbations are called Single Event Upset (SEU), or just bit-flip, and might cause the system to experience an error. The main goal of this work is to study the behavior of the GPGPU FlexGrip under the presence of SEUs in a range of applications. The distribution of computational resources of the GPUs and its impact in the GPU confiability is also explored, as well as the characterization of the errors observed in the fault injection campaigns. Results can indicate efficient configurations of GPUs in order to avoid perturbations in the system under the presence of SEUs. Microeletrônica Processamento : Imagem Simulação computacional GPU Parallel processing High performance Fault tolerance
418	Exploring the dynamic radio sky with many-core high-performance computing Malenta, Mateusz January 2018 (has links) As new radio telescopes and processing facilities are being built, the amount of data that has to be processed is growing continuously. This poses significant challenges, especially if the real-time processing is required, which is important for surveys looking for poorly understood objects, such as Fast Radio Bursts, where quick detection and localisation can enable rapid follow-up observations at different frequencies. With the data rates increasing all the time, new processing techniques using the newest hardware, such as GPUs, have to be developed. A new pipeline, called PAFINDER, has been developed to process data taken with a phased array feed, which can generate up to 36 beams on the sky, with data rates of 25 GBps per beam. With the majority of work done on GPUs, the pipeline reaches real-time performance when generating filterbank files used for offline processing. The full real-time processing, including single-pulse searches has also been implemented and has been shown to perform well under favourable conditions. The pipeline was successfully used to record and process data containing observations of RRAT J1819-1458 and positions on the sky where 3 FRBs have been observed previously, including the repeating FRB121102. Detailed examination of J1819-1458 single-pulse detections revealed a complex emission environment with pulses coming from three different rotation phase bands and a number of multi-component emissions. No new FRBs and no repeated bursts from FRB121102 have been detected. The GMRT High Resolution Southern Sky survey observes the sky at high galactic latitudes, searching for new pulsars and FRBs. 127 hours of data have been searched for the presence of any new bursts, with the help of new pipeline developed for this survey. No new FRBs have been found, which can be the result of bad RFI pollution, which was not fully removed despite new techniques being developed and combined with the existing solutions to mitigate these negative effects. Using the best estimates on the total amount of data that has been processed correctly, obtained using new single-pulse simulation software, no detections were found to be consistent with the expected rates for standard candle FRBs with a flat or positive spectrum. 500
419	Implementações de algoritmos paralelos da subsequência máxima e da submatriz máxima em GPU Luz, Cleber Silva Ferreira da January 2013 (has links) Orientador: Siang Wun Song / Dissertação (mestrado) - Universidade Federal do ABC. Programa de Pós-Graduação em Ciência da Computação, 2013 CUDA GPU Subsequência máxima submatriz máxima
420	Uma biblioteca para desenvolvimento de aplicações CUDA em aglomerados de GPUS Morais Junior, Aderbal de January 2013 (has links) Orientador: Raphael Yokoingawa de Camargo / Dissertação (mestrado) - Universidade Federal do ABC. Programa de Pós-Graduação em Ciências da Computação, 2013 CUDA MPI GPU CLUSTER AGLOMERADO

Search results