Global ETD Search

151	Performance Evaluation of a Signal Processing Algorithm with General-Purpose Computing on a Graphics Processing Unit Appelgren, Filip, Ekelund, Måns January 2019 (has links) Graphics Processing Units (GPU) are increasingly being used for general-purpose programming, instead of their traditional graphical tasks. This is because of their raw computational power, which in some cases give them an advantage over the traditionally used Central Processing Unit (CPU). This thesis therefore sets out to identify the performance of a GPU in a correlation algorithm, and what parameters have the greatest effect on GPU performance. The method used for determining performance was quantitative, utilizing a clock library in C++ to measure performance of the algorithm as problem size increased. Initial problem size was set to 28 and increased exponentially to 221. The results show that smaller sample sizes perform better on the serial CPU implementation but that the parallel GPU implementations start outperforming the CPU between problem sizes of 29 and 210. It became apparent that GPU’s benefit from larger problem sizes, mainly because of the memory overhead costs involved with allocating and transferring data. Further, the algorithm that is under evaluation is not suited for a parallelized implementation due to a high amount of branching. Logic can lead to warp divergence, which can drastically lower performance. Keeping logic to a minimum and minimizing the number of memory transfers are vital in order to reach high performance with a GPU. / GPUer (grafikprocessor) som traditionellt används för att rita grafik i datorer, används mer och mer till att utföra vanliga programmeringsuppgifter. Detta är för att de har en stor beräkningskraft, som kan ge dem ett övertag över vanliga CPUer (processor) i vissa uppgifter. Det här arbetet undersöker därför prestandaskillnaderna mellan en CPU och en GPU i en korrelations-algoritm samt vilka parametrar som har störst påverkan på prestanda. En kvantitativ metod har använts med hjälp av ett klock-bibliotek, som finns tillgängligt i C++, för att utföra tidtagning. Initial problemstorlek var satt till 28 och ökade sedan exponentiellt till 221. Resultaten visar att algoritmen är snabbare på en CPU vid mindre problemstorlekar. Däremot börjar GPUn prestera bättre än CPUn mellan problemstorlekar av 29 och 210. Det blev tydligt att GPUer tjänar på större problem, framför allt för att det tar mycket tid att involvera GPUn i algoritmen. Datäoverföringar och minnesallokering på GPUn tar tid, vilket blir tydligt vid små storlekar. Algoritmen passar sig inte heller speciellt bra för en parallell lösning, eftersom den innehåller mycket logik. En algoritm med design där exekveringstrådarna kan gå isär under exekvering, är helst att undvika eftersom mycket parallell prestanda tappas. Att minimera logik, datäoverföringar samt minnesallokeringar är viktiga delar för hög GPU-prestanda. Parallelization GPU CUDA RADAR Optimization Parallellisering GPU CUDA RADAR Optimering Computer and Information Sciences Data- och informationsvetenskap
152	Optimization of American option pricing through GPU computing / Optimering av prissättning av amerikanska optioner genom GPU-beräkningar Greinsmark, Hadar, Lindström, Erik January 2017 (has links) Over the last decades the market for financial derivatives has grown dramatically to values of global importance. With the digital automation of the markets, programs able to efficiently value financial derivatives has become key to market competitiveness and thus garnered considerable interest. This report explores the potential efficiency gains of employing modern technology in GPU computing to price financial options, using the binomial option pricing model. The model is implemented using both CPU and GPU hardware and results compared in terms of computational efficiency. According to this thesis, GPU computing can considerably improve option pricing runtimes. / Under de senaste decennierna har marknaden för finansiella derivatinstrument vuxit till värden av global betydelse. Med ökande digitalisering av marknaden har program som effektivt kan värdera derivatinstrument blivit avgörande för konkurrenskraft och därför givits avsevärt intresse. Denna rapport utforskar vilka möjliga ökningar i effektivitet som kan nås genom att använda modern teknik för GPU-beräkningar för att värdera finansiella optioner genom den binomiala optionsvärderingsmodellen. Modellen implementeras både med CPU-, och GPU-hårdvara och resultaten jämförs i termer av beräkningseffektivitet. Enligt denna studie kan GPU-beräkingar avsevärt förbättra körtider för optionsvärderingar. finance options GPU GPGPU GPU computing binomial method BOPM CUDA Computer Sciences Datavetenskap (datalogi)
153	Evaluating GPU for Radio Algorithms : Assessing Performance and Power Consumption of GPUs in Wireless Radio Communication Systems / Utvärdering av grafikprocessor för radioalgoritmer André, Albin January 2023 (has links) This thesis evaluates the viability of a Graphics Processing Unit (GPU) in tasks of signal processing associated with radio base stations in cellular networks. The development of Application Specific Integrated Circuits (ASICs) is lengthy and highly expensive, but they are efficient in terms of power consumption. It was found that the GPU implementations could not compete with ASIC solutions in terms of power efficiency. The latency was too high for real-time signal processing applications like interpolation and decimation, especially because of the large sample buffers needed to occupy the GPU. Implementations for interpolation, decimation, and digital predistortion algorithms were developed using Nvidia’s parallel programming platform CUDA on an Nvidia RTX A4000 graphics card. The performances of the implementations were tested in terms of throughput, latency, and energy consumption. / Denna rapport undersöker möjligheten att använda en grafikprocessor för signalbehandlingsuppgifter vanligtvis associerade med radiobasstationer i mobila nätverk. Utvecklandet av Application Specific Integrated Circuits (ASICs) är långvarigt och ytterst kostsamt, men de har hög effektivitet med avseende på energiförbrukning. Det konstaterades att grafikprocessorimplementationerna inte kunde konkurrera med ASIC-lösningar med hänsyn till energiförbrukning. Fördröjningen var för hög för interpolering och decimering i realtid, speciellt på grund av de stora bufferstorlekarna som krävs för att sysselsätta grafikprocessorn. Implementationer för interpolering, decimering, och digital fördistordering utvecklades med Nvidias platform för parallell programmering CUDA, på ett Nvidia RTX A4000 grafikkort. Implementationernas prestanda testades med hänsyn till genomströmning, fördröjning, och energiförbrukning. GPU CUDA Signal Processing Radio Algorithms GPU CUDA Signalbehandling Radio Algoritmer Computer Sciences Datavetenskap (datalogi)
154	Optimizing Systems for Deep Learning Applications Albahar, Hadeel Ahmad 01 March 2023 (has links) Modern systems for Machine Learning (ML) workloads support heterogeneous workloads and resources. However, existing resource managers in these systems do not differentiate between heterogeneous GPU resources. Moreover, users are usually unaware of the sufficient and appropriate type and amount of GPU resources to request for their ML jobs. In this thesis, we analyze the performance of ML training and inference jobs and identify ML model and GPU characteristics that impact this performance. We then propose ML-based prediction models to accurately determine appropriate and sufficient resource requirements to ensure improved job latency and GPU utilization in the cluster. / Doctor of Philosophy / We daily interact with and use many software applications such as social media, e-commerce, healthcare, and finance. These applications rely on different computing systems as well as artificial intelligence to deliver users the best service and experience. In this dissertation, we present optimizations to improve the performance of these artificial intelligence applications while at the same time improving the performance and the utilization of the systems and the heterogeneous resources they run on. We propose utilizing machine learning models, that learn from historical data of application performance as well as application and resource characteristics, to predict the necessary and sufficient resource requirements for these applications to ensure the optimal performance for the application and the underlying system. GPU heterogeneity Deep Learning and Inference Kubernetes GPU sharing Resource requirement prediction
155	Generalizing the Utility of Graphics Processing Units in Large-Scale Heterogeneous Computing Systems Xiao, Shucai 03 July 2013 (has links) Today, heterogeneous computing systems are widely used to meet the increasing demand for high-performance computing. These systems commonly use powerful and energy-efficient accelerators to augment general-purpose processors (i.e., CPUs). The graphic processing unit (GPU) is one such accelerator. Originally designed solely for graphics processing, GPUs have evolved into programmable processors that can deliver massive parallel processing power for general-purpose applications. Using SIMD (Single Instruction Multiple Data) based components as building units; the current GPU architecture is well suited for data-parallel applications where the execution of each task is independent. With the delivery of programming models such as Compute Unified Device Architecture (CUDA) and Open Computing Language (OpenCL), programming GPUs has become much easier than before. However, developing and optimizing an application on a GPU is still a challenging task, even for well-trained computing experts. Such programming tasks will be even more challenging in large-scale heterogeneous systems, particularly in the context of utility computing, where GPU resources are used as a service. These challenges are largely due to the limitations in the current programming models: (1) there are no intra-and inter-GPU cooperative mechanisms that are natively supported; (2) current programming models only support the utilization of GPUs installed locally; and (3) to use GPUs on another node, application programs need to explicitly call application programming interface (API) functions for data communication. To reduce the mapping efforts and to better utilize the GPU resources, we investigate generalizing the utility of GPUs in large-scale heterogeneous systems with GPUs as accelerators. We generalize the utility of GPUs through the transparent virtualization of GPUs, which can enable applications to view all GPUs in the system as if they were installed locally. As a result, all GPUs in the system can be used as local GPUs. Moreover, GPU virtualization is a key capability to support the notion of "GPU as a service." Specifically, we propose the virtual OpenCL (or VOCL) framework for the transparent virtualization of GPUs. To achieve good performance, we optimize and extend the framework in three aspects: (1) optimize VOCL by reducing the data transfer overhead between the local node and remote node; (2) propose GPU synchronization to reduce the overhead of switching back and forth if multiple kernel launches are needed for data communication across different compute units on a GPU; and (3) extend VOCL to support live virtual GPU migration for quick system maintenance and load rebalancing across GPUs. With the above optimizations and extensions, we thoroughly evaluate VOCL along three dimensions: (1) show the performance improvement for each of our optimization strategies; (2) evaluate the overhead of using remote GPUs via several microbenchmark suites as well as a few real-world applications; and (3) demonstrate the overhead as well as the benefit of live virtual GPU migration. Our experimental results indicate that VOCL can generalize the utility of GPUs in large-scale systems at a reasonable virtualization and migration cost. / Ph. D. Graphics Processing Unit (GPU) CUDA OpenCL BLAST Smith-Waterman Fine-Grained Parallelization GPU Virtualization
156	Performance Modeling, Optimization, and Characterization on Heterogeneous Architectures Panwar, Lokendra Singh 21 October 2014 (has links) Today, heterogeneous computing has truly reshaped the way scientists think and approach high-performance computing (HPC). Hardware accelerators such as general-purpose graphics processing units (GPUs) and Intel Many Integrated Core (MIC) architecture continue to make in-roads in accelerating large-scale scientific applications. These advancements, however, introduce new sets of challenges to the scientific community such as: selection of best processor for an application, effective performance optimization strategies, maintaining performance portability across architectures etc. In this thesis, we present our techniques and approach to address some of these significant issues. Firstly, we present a fully automated approach to project the relative performance of an OpenCL program over different GPUs. Performance projections can be made within a small amount of time, and the projection overhead stays relatively constant with the input data size. As a result, the technique can help runtime tools make dynamic decisions about which GPU would run faster for a given kernel. Usage cases of this technique include scheduling or migrating GPU workloads over a heterogeneous cluster with different types of GPUs. We then present our approach to accelerate a seismology modeling application that is based on the finite difference method (FDM), using MPI and CUDA over a hybrid CPU+GPU cluster. We describe the generic computational complexities involved in porting such applications to the GPUs and present our strategy of efficient performance optimization and characterization. We also show how performance modeling can be used to reason and drive the hardware-specific optimizations on the GPU. The performance evaluation of our approach delivers a maximum speedup of 23-fold with a single GPU and 33-fold with dual GPUs per node over the serial version of the application, which in turn results in a many-fold speedup when coupled with the MPI distribution of the computation across the cluster. We also study the efficacy of GPU-integrated MPI, with MPI-ACC as an example implementation, in a seismology modeling application and discuss the lessons learned. / Master of Science Heterogeneous Computing Graphics Processing Unit (GPU) GPU Emulation Performance Modeling Finite Difference Method Seismology Modeling
157	Real-time Visualization of Massive 3D Models on GPU Parallel Architectures Peng, Chao 24 April 2013 (has links) Real-time rendering of massive 3D models has been recognized as a challenging task due to the limited computational power and memory available in a workstation. Most existing acceleration techniques, such as mesh simplification algorithms with hierarchical data structures, suffer from the nature of sequential executions. As data complexity increases due to the fundamental advances in modeling and simulation technologies, 3D models become complex and require gigabytes in storage. Consequently, visualizing such large datasets becomes a computationally intensive process where sequential solutions are unable to satisfy the demands of real-time rendering. Recently, the Graphics Processing Unit (GPU) has been praised as a massively parallel architecture not only for its significant improvements in performance but also because of its programmability for general-purpose computation. Today's GPUs allow researchers to solve problems by delivering fine-grained parallel implementations. In this dissertation, I concentrate on the design of parallel algorithms for real-time rendering of massive 3D polygonal models towards modern GPU architectures. As a result, the delivered rendering system supports high-performance visualization of 3D models composed of hundreds of millions of polygons on a single commodity workstation. / Ph. D. mesh simplification LOD selection visibility culling GPU out-of-core massive model rendering GPGPU multi-GPU
158	Analysis, Implementation and Evaluation of Direction Finding Algorithms using GPU Computing / Analys, implementering och utvärdering av riktningsbestämningsalgoritmer på GPU Andersdotter, Regina January 2022 (has links) Direction Finding (DF) algorithms are used by the Swedish Defence Research Agency (FOI) in the context of electronic warfare against radio. Parallelizing these algorithms using a Graphics Processing Unit (GPU) might improve performance, and thereby increase military support capabilities. This thesis selects the DF algorithms Correlative Interferometer (CORR), Multiple Signal Classification (MUSIC) and Weighted Subspace Fitting (WSF), and examines to what extent GPU implementation of these algorithms is suitable, by analysing, implementing and evaluating. Firstly, six general criteria for GPU suitability are formulated. Then the three algorithms are analyzed with regard to these criteria, giving that MUSIC and WSF are both 58% suitable, closely followed by CORR on 50% suitability. MUSIC is selected for implementation, and an open source implementation is extended to three versions: a multicore CPU version, a GPU version (with Eigenvalue Decomposition (EVD) and pseudo spectrum calculation performed on the GPU), and a MIXED version (with only pseudo spectrum calculation on the GPU). These versions are then evaluated for angle resolutions between 1° and 0.025°, and CUDA block sizes between 8 and 1024. It is found that the GPU version is faster than the CPU version for angle resolutions above 0.1°, and the largest measured speedup is 1.4 times. The block size has no large impact on the total runtime. In conclusion, the overall results indicate that it is not entirely suitable, yet somewhat beneficial for large angle resolutions, to implement MUSIC using GPU computing. GPU GPU Computing Direction Finding GPU Suitability CUDA Multiple Signal Classification Weighted Subspace Fitting Correlative Interferometer runtime angle resolution block size Computer Sciences Datavetenskap (datalogi)
159	Comparison of Technologies for General-Purpose Computing on Graphics Processing Units Sörman, Torbjörn January 2016 (has links) The computational capacity of graphics cards for general-purpose computinghave progressed fast over the last decade. A major reason is computational heavycomputer games, where standard of performance and high quality graphics constantlyrise. Another reason is better suitable technologies for programming thegraphics cards. Combined, the product is high raw performance devices andmeans to access that performance. This thesis investigates some of the currenttechnologies for general-purpose computing on graphics processing units. Technologiesare primarily compared by means of benchmarking performance andsecondarily by factors concerning programming and implementation. The choiceof technology can have a large impact on performance. The benchmark applicationfound the difference in execution time of the fastest technology, CUDA, comparedto the slowest, OpenCL, to be twice a factor of two. The benchmark applicationalso found out that the older technologies, OpenGL and DirectX, are competitivewith CUDA and OpenCL in terms of resulting raw performance. gpgpu gpu benchmark cuda opencl directcompute opengl compute shader
160	Optimizing Sparse Matrix-Matrix Multiplication on a Heterogeneous CPU-GPU Platform Wu, Xiaolong 16 December 2015 (has links) Sparse Matrix-Matrix multiplication (SpMM) is a fundamental operation over irregular data, which is widely used in graph algorithms, such as finding minimum spanning trees and shortest paths. In this work, we present a hybrid CPU and GPU-based parallel SpMM algorithm to improve the performance of SpMM. First, we improve data locality by element-wise multiplication. Second, we utilize the ordered property of row indices for partial sorting instead of full sorting of all triples according to row and column indices. Finally, through a hybrid CPU-GPU approach using two level pipelining technique, our algorithm is able to better exploit a heterogeneous system. Compared with the state-of-the-art SpMM methods in cuSPARSE and CUSP libraries, our approach achieves an average of 1.6x and 2.9x speedup separately on the nine representative matrices from University of Florida sparse matrix collection. Sparse matrix-matrix multiplication Data locality Pipelining GPU

Search results