Global ETD Search

121	GPU Volume Voxelization : Exploration of the performance characteristics of different GPU-based implementations Glukhov, Grigory, Soltan, Aleksandra January 2019 (has links) In recent years, voxel-based modelling has seen a reintroduction to computer game development through massive graphics hardware improvements. Never- theless, polygons continue to be the default building block of 3D objects, intro- ducing a need for the transformation of polygon meshes into voxel-based models; this process is known as voxelization. Efficient voxelization algorithms take ad- vantage of the flexibility and control offered by modern, programmable GPU pipelines. However, the variability in possible approaches poses the question of how different GPU-based implementations affect voxelization performance.This thesis explores the impact of GPU-based improvements by comparing four different implementations of a solid voxelization algorithm. The implemen- tations include a naive transition from the CPU to the GPU, a non-branching execution path approach, data pre-processing, and a combination of the two previous approaches. Benchmarking experiments run on four, standard polygo- nal models and three graphics cards (NVIDIA and AMD) provide runtime and memory usage data for each implementation. A comparative analysis is per- formed on the basis of this data to determine the performance impact of the GPU-based adjustments to the voxelization algorithm implementation.Results indicate that the non-branching execution path approach yields clear improvements over the naive implementation, while data pre-processing has in- consistent performance and a large initial performance cost; the combination of the two improvements unsurprisingly leads to combined results. Therefore, the conclusive recommendation is using the non-branching execution path technique for GPU-based improvements. / Voxel-baserad modellering har på senare år blivit återintroducerat till datorspelsutveckling tack vare massiva förbättringar i grafikhårdvara. Trots detta fortsätter polygoner att vara standarden för uppbyggnaden av 3D-objekt. Detta gör det nödvändigt att kunna transformera polygonytor till voxel-baserade modeller; denna process kallas för voxelisering. Effektiva voxeliseringsalgoritmer tar vara på den flexibilitet och kontroll som ges av moderna, programmerbara GPU-pipelines. Variationen i möjliga tillvägagångssätt gör det dock intressant att veta hur olika GPU-baserade implementationer påverkar prestandan av voxeliseringen. Denna avhandling undersöker påverkan av GPU-baserade förbättringar genom att jämföra fyra olika implementationer av en solid-voxeliseringsalgoritm. Implementationerna inkluderar en naiv övergång från CPU:n till GPU:n, en metod med en non-branching exekveringsväg, förbehandling av data, och en kombination av det två tidigare metoderna. Benchmarkingexperiment görs på fyra standardpolygonmodeller och tre grafikkort (NVIDIA och AMD) förser data för exekveringstid och minnesåtgång för varje implementation. En jämförande analys görs med detta data som grund för att bestämma den påverkan som de GPU-baserade ändringarna har på prestandan av voxeliseringsalgoritmens implementation. Resultaten indikerar att implementationen med en non-branching exekveringsväg ger klara förbättringar över den naiva implementationen, medans förbehandlingen av data presterar inkonsekvent och har en stor initial prestandakostnad; kombinationen av dem båda ledde, inte överraskande, till blandade resultat. Den slutgiltiga rekommendationen är således att använda tekniken med en non-branching exekveringsväg för GPU-baserade förbättringar. voxelization GPU GPGPU SIMT thread divergence Vulkan API voxelization GPU GPGPU SIMT tråd divergering Vulkan API Computer and Information Sciences Data- och informationsvetenskap
122	Register Caching for Energy Efficient GPGPU Tensor Core Computing / Registrera cachelagring för energieffektiv GPGPU Tensor Core Computing Qian, Qiran January 2023 (has links) The General-Purpose GPU (GPGPU) has emerged as the predominant computing device for extensive parallel workloads in the fields of Artificial Intelligence (AI) and Scientific Computing, primarily owing to its adoption of the Single Instruction Multiple Thread architecture, which not only provides a wealth of thread context but also effectively hide the latencies exposed in the single threads executions. As computational demands have evolved, modern GPGPUs have incorporated specialized matrix engines, e.g., NVIDIA’s Tensor Core (TC), in order to deliver substantially higher throughput for dense matrix computations compared with traditional scalar or vector architectures. Beyond mere throughput, energy efficiency is a pivotal concern in GPGPU computing. The register file is the largest memory structure on the GPGPU die and typically accounts for over 20% of the dynamic power consumption. To enhance energy efficiency, GPGPUs incorporate a technique named register caching borrowed from the realm of CPUs. Register caching captures temporal locality among register operands to reduce energy consumption within a 2- level register file structure. The presence of TC raises new challenges for Register Cache (RC) design, as each matrix instruction applies intensive operand delivering traffic on the register file banks. In this study, we delve into the RC design trade-offs in GPGPUs. We undertake a comprehensive exploration of the design space, encompassing a range of workloads. Our experiments not only reveal the basic design considerations of RC but also clarify that conventional caching strategies underperform, particularly when dealing with TC computations, primarily due to poor temporal locality and the substantial register operand traffic involved. Based on these findings, we propose an enhanced caching strategy featuring a look-ahead allocation policy to minimize unnecessary cache allocations for the destination register operands. Furthermore, to leverage the energy efficiency of Tensor Core computing, we highlight an alternative instruction scheduling framework for Tensor Core instructions that collaborates with a specialized caching policy, resulting in a remarkable reduction of up to 50% in dynamic energy consumption within the register file during Tensor Core GEMM computations. / Den allmänna ändamålsgrafikprocessorn (GPGPU) har framträtt som den dominerande beräkningsenheten för omfattande parallella arbetsbelastningar inom områdena för artificiell intelligens (AI) och vetenskaplig beräkning, huvudsakligen tack vare dess antagande av arkitekturen för enkel instruktion, flera trådar (Single Instruction Multiple Thread), vilket inte bara ger en mängd trådcontext utan också effektivt döljer de latenser som exponeras vid enskilda trådars utförande. När beräkningskraven har utvecklats har moderna GPGPU:er inkorporerat specialiserade matrismotorer, t.ex., NVIDIAs Tensor Core (TC), för att leverera avsevärt högre genomströmning för täta matrisberäkningar jämfört med traditionella skalär- eller vektorarkitekturer. Bortom endast genomströmning är energieffektivitet en central oro inom GPGPUberäkning. Registerfilen är den största minnesstrukturen på GPGPU-dien och svarar vanligtvis för över 20% av den dynamiska effektförbrukningen För att förbättra energieffektiviteten inkorporerar GPGPU:er en teknik vid namn registercachning, lånad från CPU-världen. Registercachning fångar temporal lokalitet bland registeroperanderna för att minska energiförbrukningen inom en 2-nivåers registerfilstruktur. Närvaron av TC innebär nya utmaningar för Register Cache (RC)-design, eftersom varje matrisinstruktion genererar intensiv operandleverans på registerfilbankarna. I denna studie fördjupar vi oss i RC-designavvägandena i GPGPU:er. Vi genomför en omfattande utforskning av designutrymmet, som omfattar olika arbetsbelastningar. Våra experiment avslöjar inte bara de grundläggande designövervägandena för RC utan klargör också att konventionella cachestrategier underpresterar, särskilt vid hantering av TC-beräkningar, främst på grund av dålig temporal lokalitet och den betydande trafiken med registeroperand. Baserat på dessa resultat föreslår vi en förbättrad cachestrategi med en look-ahead-alloceringspolicy för att minimera onödiga cacheallokeringar för destinationens registeroperand. Dessutom, för att dra nytta av energieffektiviteten hos Tensor Core-beräkning, belyser vi en alternativ instruktionsplaneringsram för Tensor Core-instruktioner som samarbetar med en specialiserad cachelayout, vilket resulterar i en anmärkningsvärd minskning av upp till 50% i dynamisk energiförbrukning inom registerfilen under Tensor Core GEMM-beräkningar. Computer Architecture GPGPU Tensor Core GEMM Energy Efficiency Register File Cache Instruction Scheduling Datorarkitektur GPGPU Tensor Core GEMM energieffektivitet registerfil cache instruktionsschemaläggning Computer and Information Sciences Data- och informationsvetenskap
123	Simulation de fluides, approche lagrangienne Wattez, Adrien January 2014 (has links) Avec la généralisation du recours à l’infographie dans l’industrie des loisirs, la demande concernant la production de scènes de simulation de fluides d’un réalisme croissant a fortement augmenté durant les deux dernières décennies. Nous proposons de nombreux éléments pertinents pour simuler le fluide, essentiellement tournés vers l’approche lagrangienne (les méthodes particulaires). Cette présentation a donc pour objet l’étude et la mise au point de techniques permettant de reproduire le comportement des fluides s’appuyant sur l’aspect particulaire du fluide. Les algorithmes de ces dernières années permettent un gain de performance significatif, nous permettant d’obtenir des simulations de fluides incompressibles en temps réel. L’usage des noyaux constants par morceaux, nouvel outil de calcul numérique, au sein de simulations de fluides dites lagrangiennes sera également abordé. Avec l’augmentation continue de la puissance de calcul et de nouvelles avancées telles que la programmation dite GPGPU, nous verrons également comment obtenir une recherche de voisinage efficace permettant d’augmenter grandement les performances de calcul. SPH Simulation de fluides Méthode lagrangienne Équations de Navier - Stokes GPGPU CUDA Noyaux constants par morceaux
124	Improving Visualisation of Large Multi-Variate Datasets: New Hardware-Based Compression Algorithms and Rendering Techniques Chernoglazov, Alexander Igorevich January 2012 (has links) Spectral computed tomography (CT) is a novel medical imaging technique that involves simultaneously counting photons at several energy levels of the x-ray spectrum to obtain a single multi-variate dataset. Visualisation of such data poses significant challenges due its extremely large size and the need for interactive performance for scientific and medical end-users. This thesis explores the properties of spectral CT datasets and presents two algorithms for GPU-accelerated real-time rendering from compressed spectral CT data formats. In addition, we describe an optimised implementation of a volume raycasting algorithm on modern GPU hardware, tailored to the visualisation of spectral CT data. visualisation compression volume rendering gpu gpgpu cuda spectral ct ct rendering optimisation
125	ModPET: Novel Applications of Scintillation Cameras to Preclinical PET Moore, Stephen K. January 2011 (has links) We have designed, developed, and assessed a novel preclinical positron emission tomography (PET) imaging system named ModPET. The system was developed using modular gamma cameras, originally developed for SPECT applications at the Center for Gamma Ray Imaging (CGRI), but configured for PET imaging by enabling coincidence timing. A pair of cameras are mounted on a exible system gantry that also allows for acquisition of optical images such that PET images can be registered to an anatomical reference. Data is acquired in a super list-mode form where raw PMT signals and event times are accumulated in events lists for each camera. Event parameter estimation of position and energy is carried out with maximum likelihood methods using careful camera calibrations accomplished with collimated beams of 511-keV photons and a new iterative mean-detector-response-function processing routine. Intrinsic lateral spatial resolution for 511-keV photons was found to be approximately 1.6 mm in each direction. Lists of coincidence pairs are found by comparing event times in the two independent camera lists. A timing window of 30 nanoseconds is used. By bringing the 4.5 inch square cameras in close proximity, with a 32-mm separation for mouse imaging, a solid angle coverage of ∼75% partially compensates for the relatively low stopping power in the 5-mm-thick NaI crystals to give a mea- sured sensitivity of up to 0.7%. An NECR analysis yields 11,000 pairs per second with 84 μCi of activity. A list-mode MLEM reconstruction algorithm was developed to reconstruct objects in a 88 x 88 x 30 mm field of view. Tomographic resolution tests with a phantom suggest a lateral resolution of 1.5 mm and a slightly degraded resolution of 2.5 mm in the direction normal to the camera faces. The system can also be configured to provide (99m)Tc planar scintigraphy images. Selected biological studies of inammation, apoptosis, tumor metabolism, and bone osteogenic activity are presented. list-mode likelihood reconstruction maximum-likelihood PET predclinical Biomedical Engineering energy estimation GPGPU
126	Dissecting genetic interactions in complex traits Hemani, Gibran January 2012 (has links) Of central importance in the dissection of the components that govern complex traits is understanding the architecture of natural genetic variation. Genetic interaction, or epistasis, constitutes one aspect of this, but epistatic analysis has been largely avoided in genome wide association studies because of statistical and computational difficulties. This thesis explores both issues in the context of two-locus interactions. Initially, through simulation and deterministic calculations it was demonstrated that not only can epistasis maintain deleterious mutations at intermediate frequencies when under selection, but that it may also have a role in the maintenance of additive variance. Based on the epistatic patterns that are evolutionarily persistent, and the frequencies at which they are maintained, it was shown that exhaustive two dimensional search strategies are the most powerful approaches for uncovering both additive variance and the other genetic variance components that are co-precipitated. However, while these simulations demonstrate encouraging statistical benefits, two dimensional searches are often computationally prohibitive, particularly with the marker densities and sample sizes that are typical of genome wide association studies. To address this issue different software implementations were developed to parallelise the two dimensional triangular search grid across various types of high performance computing hardware. Of these, particularly effective was using the massively-multi-core architecture of consumer level graphics cards. While the performance will continue to improve as hardware improves, at the time of testing the speed was 2-3 orders of magnitude faster than CPU based software solutions that are in current use. Not only does this software enable epistatic scans to be performed routinely at minimal cost, but it is now feasible to empirically explore the false discovery rates introduced by the high dimensionality of multiple testing. Through permutation analysis it was shown that the significance threshold for epistatic searches is a function of both marker density and population sample size, and that because of the correlation structure that exists between tests the threshold estimates currently used are overly stringent. Although the relaxed threshold estimates constitute an improvement in the power of two dimensional searches, detection is still most likely limited to relatively large genetic effects. Through direct calculation it was shown that, in contrast to the additive case where the decay of estimated genetic variance was proportional to falling linkage disequilibrium between causal variants and observed markers, for epistasis this decay was exponential. One way to rescue poorly captured causal variants is to parameterise association tests using haplotypes rather than single markers. A novel statistical method that uses a regularised parameter selection procedure on two locus haplotypes was developed, and through extensive simulations it can be shown that it delivers a substantial gain in power over single marker based tests. Ultimately, this thesis seeks to demonstrate that many of the obstacles in epistatic analysis can be ameliorated, and with the current abundance of genomic data gathered by the scientific community direct search may be a viable method to qualify the importance of epistasis. 572.8
127	Calcul en n-dimensions sur GPU Bergeron, Arnaud 04 1900 (has links) Le code source de la libraire développée accompagne ce dépôt dans l'état où il était à ce moment. Il est possible de trouver une version plus à jour sur github (http://github.com/abergeron). / Le calcul scientifique sur processeurs graphiques (GPU) est en plein essor depuis un certain temps, en particulier dans le domaine de l'apprentissage machine. Cette thèse présente les efforts pour établir une structure de données de table au multidimensionnel de manière efficace sur GPU. Nous commençons par faire une revue de ce qui est actuellement similaire dans le domaine et des désavantages d'avoir une multitude d'approches. Nous nous intéresserons particulièrement aux calculs fait à partir du langage Python. Nous décrirons des techniques intéressantes telles que la réduction d'ordre et le calcul asynchrone automatique. Pour terminer nous présenterons l'utilisation du module développé dans le cadre de cette thèse. / Scientific computing on GPU (graphical processing units) is on the rise, specifically in machine learning. This thesis presents the implementation of an efficient multidimensional array on the GPU. We will begin by a review of what currently implements similar functionality and the disadvantage of a fragmented approach. We will focus on packages that have a Python interface. We will explain techniques to optimize execution such as order reduction and automatic asynchronous computations. Finally, we will present the functionality of the module developed for this thesis. Calcul scientifique Python GPGPU Scientific computing
128	Dynamická simulace tuhých těles na programovatelných GPU / Dynamic simulation of rigid bodies using programmable GPUs Cséfalvay, Szabolcs January 2011 (has links) The goal of this work is to create a program which simulates the dynamics of rigid bodies and their systems using GPGPU with an emphasis on speed and stability. The result is a physics engine that uses the CUDA architecture. It runs entirely on the GPU, handles collision detection, collision response and different forces like friction, gravity, contact forces, etc. It supports spheres, rods (which are similar to cylinders), springs, boxes and planes. It's also possible to construct compound objects by connecting basic primitives.
129	Using Graphical Processors to Implement Radio Base Station Control Plane Functions / Implementera radiobasstationers kontrollplans funktioner med grafikprocessor Ringman, Noak January 2019 (has links) Today more devices are being connected to the Internet via mobile networks. With more devices in mobile networks, the workload on radio base stations increases. Radio base stations must be energy efficient and cheap which makes high-performance central processing units (CPUs) a bad alternative to meet the increasing workload. An alternative could be a graphics processing unit (GPU) which have a different hardware architecture more suitable for data parallel problems. This thesis has investigated the parallelisation possibilities in the user-equipment handling part of radio base stations, and the aim was to use a GPU to take advantage of the parallelism. The investigation found a mixed pipeline and data parallelism in user-equipment handling. A parallelism suitable for a graphics processing unit (GPU) execution. The tasks which handle user-equipment were divided into smaller communication-free sub-tasks. Sub-task batches of user-equipment were collected and offloaded to a GPU. A peak throughput gain of 62.2 times over the single-threaded CPU was achieved, but with an impact on latency with more than a magnitude. The latency was for all workloads at least 1.24 higher for the GPU implementations compared to the CPU implementations. A radio base station with many more user-equipment than the once existing today was simulated. For this radio base station, a gain of 14.0 times the single-threaded CPU was achieved, while the latency increased by 2.4 times. To really make use of a GPU implementation the number of user-equipment, the load, must be higher than in existing radio base stations today. Mobile networks Radio base stations Control plane GPU GPGPU CUDA OpenCL Computer Engineering Datorteknik
130	Quelques applications de la programmation des processeurs graphiques à la simulation neuronale et à la vision par ordinateur Chariot, Alexandre 16 December 2008 (has links) (PDF) Largement poussés par l'industrie vidéoludique, la recherche et le développement d'outils matériels destinés à la génération d'images de synthèse, tels les cartes graphiques (ou GPU, Graphics Processing Units), ont connu un essor formidable ces dernières années. L'augmentation de puissance et de [MATH] Mathematics Gpu Gpgpu Programmation parallèle Réseaux de neurones Vision par Ordinateur Stéréovision Points d'intérêt Mise en correspondance

Search results