Global ETD Search

781	Fully Automatic Upper Airway Segmentation and Surfacing on a GPU from Cone-beam CT Volumes Farrell, Michael L. January 2009 (has links) No description available. Biomedical Research Computer Science cone beam ct airway segmentation automatic gpu gpgpu cuda computer science medical imaging
782	Acceleration of Massive MIMO algorithms for Beyond 5G Baseband processing Nihl, Ellen, de Bruijckere, Eek January 2023 (has links) As the world becomes more globalised, user equipment such as smartphones and Internet of Things devices require increasingly more data, which increases the demand for wireless data traffic. Hence, the acceleration of next-generational networks (5G and beyond) focuses mainly on increasing the bitrate and decreasing the latency. A crucial technology for 5G and beyond is the massive MIMO. In a massive MIMO system, a detector processes the received signals from multiple antennas to decode the transmitted data and extract useful information. This has been implemented in many ways, and one of the most used algorithms is the Zero Forcing (ZF) algorithm. This thesis presents a novel parallel design to accelerate the ZF algorithm using the Cholesky decomposition. This is implemented on a GPU, written in the CUDA programming language, and compared to the existing state-of-the-art implementations regarding latency and throughput. The implementation is also validated from a MATLAB implementation. This research demonstrates promising performance using GPUs for massive MIMO detection algorithms. Our approach achieves a significant speedup factor of 350 in comparison to a serial version of the implementation. The throughput achieved is 160 times greater than a comparable GPU-based approach. Despite this, our approach reaches a 2.4 times lower throughput than a solution that employed application-specific hardware. Given the promising results, we advocate for continued research in this area to further optimise detection algorithms and enhance their performance on GPUs, to potentially achieve even higher throughput and lower latency. / <p>Our examiner Mahdi wants to wait six months before the thesis is published. </p> embedded systems massive MIMO GPU massive MIMO detection Zero-Forcing algorithm Cholesky decomposition CUDA 5G and Beyond Embedded Systems Inbäddad systemteknik
783	EFFICIENT AND PRODUCTIVE GPU PROGRAMMING Mengchi Zhang (13109886) 28 July 2022 (has links) <p> </p> <p>Productive programmable accelerators, like GPUs, have been developed for generations to support programming features. The ever-increasing performance improves the usability of programming features on GPUs, and these programming features further ease the porting of code and data structure from CPU to GPU. However, GPU programming features, such as function call or runtime polymorphism, have not been well explored or optimized.</p> <p>I identify efficient and productive GPU programming as a potential area to exploit. Although many programming paradigms are well studied and efficiently supported on CPU architectures, their performance on novel accelerators, like GPUs, has never been studied, evaluated, and made perfect. For instance, programming with functions is a commonplace programming paradigm that shapes software programs with modularity and simplifies code with reusability. A large amount of work has been proposed to alleviate function calling overhead on CPUs, however, less paper talked about its deficiencies on GPUs. On the other hand, polymorphism amplifies an object’s behaviors at runtime. A body of work targets</p> <p>efficient polymorphism on CPUs, but no work has ever discussed this feature under GPU contexts.</p> <p><br></p> <p>In this dissertation, I discussed those two programming features on GPU architectures. First, I performed the first study to identify the deficiency of GPU polymorphism. I created micro-benchmarks to evaluate virtual function overhead in controlled settings and the first GPU polymorphic benchmark suite, ParaPoly, to investigate real-world scenarios. The micro-benchmarks indicated that the virtual function overhead is usually negligible but can</p> <p>cause up to a 7x slowdown. Virtual functions in ParaPoly show a geometric meaning of 77% overhead on GPUs compared to the function’s inlined version. Second, I proposed two novel techniques that determine an object’s type only by its address pointer to improve GPU polymorphism. The first technique, Coordinated Object</p> <p>Allocation and function Lookup (COAL) is a software-only technique that uses the object’s address to determine its type. The second technique, TypePointer, needs hardware modification to embed the object’s type information into its address pointer. COAL achieves 80% and 6% improvements, and TypePointer achieves 90% and 12% over contemporary CUDA and our type-based SharedOA.</p> <p>Considering the growth of GPU programs, function calls become a pervasive paradigm to be consistently used on GPUs. I also identified the overhead of excessive register spilling with function calls on GPU. To diminish this cost, I proposed a novel Massively Multithreaded Register Windowing technique with Variable Size Register Window and Register-Conscious Warp Scheduling. Our techniques improve the representative workloads with a geometric</p> <p>mean of 1.18x with only 1.8% hardware storage overhead.</p> Digital processor architectures Distributed systems and algorithms Operating systems Programming languages GPU Programmability Function Virtual Function Polymorphism Object-Oriented Programming
784	Scalable and Energy-Efficient SIMT Systems for Deep Learning and Data Center Microservices Mahmoud Khairy A. Abdallah (12894191) 04 July 2022 (has links) <p> </p> <p>Moore’s law is dead. The physical and economic principles that enabled an exponential rise in transistors per chip have reached their breaking point. As a result, High-Performance Computing (HPC) domain and cloud data centers are encountering significant energy, cost, and environmental hurdles that have led them to embrace custom hardware/software solutions. Single Instruction Multiple Thread (SIMT) accelerators, like Graphics Processing Units (GPUs), are compelling solutions to achieve considerable energy efficiency while still preserving programmability in the twilight of Moore’s Law.</p> <p>In the HPC and Deep Learning (DL) domain, the death of single-chip GPU performance scaling will usher in a renaissance in multi-chip Non-Uniform Memory Access (NUMA) scaling. Advances in silicon interposers and other inter-chip signaling technology will enable single-package systems, composed of multiple chiplets that continue to scale even as per-chip transistors do not. Given this evolving, massively parallel NUMA landscape, the placement of data on each chiplet, or discrete GPU card, and the scheduling of the threads that use that data is a critical factor in system performance and power consumption.</p> <p>Aside from the supercomputer space, general-purpose compute units are still the main driver of data center’s total cost of ownership (TCO). CPUs consume 60% of the total data center power budget, half of which comes from the CPU pipeline’s frontend. Coupled with the hardware efficiency crisis is an increased desire for programmer productivity, flexible scalability, and nimble software updates that have led to the rise of software microservices. Consequently, single servers are now packed with many threads executing the same, relatively small task on different data.</p> <p>In this dissertation, I discuss these new paradigm shifts, addressing the following concerns: (1) how do we overcome the non-uniform memory access overhead for next-generation multi-chiplet GPUs in the era of DL-driven workloads?; (2) how can we improve the energy efficiency of data center’s CPUs in the light of microservices evolution and request similarity?; and (3) how to study such rapidly-evolving systems with an accurate and extensible SIMT performance modeling?</p> Distributed systems and algorithms Operating systems Programming languages SIMT Deep Learning Microservices Systems GPU computing Data Center Energy Efficiency
785	Grafikkort till parallella beräkningar Music, Sani January 2012 (has links) Den här studien beskriver hur grafikkort kan användas på en bredare front änmultimedia. Arbetet förklarar och diskuterar huvudsakliga alternativ som finnstill att använda grafikkort till generella operationer i dagsläget. Inom denna studieanvänds Nvidias CUDA arkitektur. Studien beskriver hur grafikkort användstill egna operationer rent praktiskt ur perspektivet att vi redan kan programmerai högnivåspråk och har grundläggande kunskap om hur en dator fungerar. Vianvänder s.k. accelererade bibliotek på grafikkortet (THRUST och CUBLAS) föratt uppnå målet som är utveckling av programvara och prestandatest. Resultatetär program som använder GPU:n till generella och prestandatest av dessa,för lösning av olika problem (matrismultiplikation, sortering, binärsökning ochvektor-inventering) där grafikkortet jämförs med processorn seriellt och parallellt.Resultat visar att grafikkortet exekverar upp till ungefär 50 gånger snabbare(tidsmässigt) kod jämfört med seriella program på processorn. / This study describes how we can use graphics cards for general purpose computingwhich differs from the most usual field where graphics cards are used, multimedia.The study describes and discusses present day alternatives for usinggraphic cards for general operations. In this study we use and describe NvidiaCUDA architecture. The study describes how we can use graphic cards for generaloperations from the point of view that we have programming knowledgein some high-level programming language and knowledge of how a computerworks. We use accelerated libraries (THRUST and CUBLAS) to achieve our goalson the graphics card, which are software development and benchmarking. Theresults are programs countering certain problems (matrix multiplication, sorting,binary search, vector inverting) and the execution time and speedup forthese programs. The graphics card is compared to the processor in serial andthe processor in parallel. Results show a speedup of up to approximatly 50 timescompared to serial implementations on the processor. Nvidia CUDA THRUST CUBLAS Eigen OpenMP accelererade bibliotek prestandatest GPU CPU vektor inventering sortering binärsökning matrismultiplikation Engineering and Technology Teknik och teknologier
786	Optimización estocástica acelerada con aplicación a la ingeniería de procesos Damiani, Lucía 09 October 2019 (has links) Los problemas de optimización no lineal que poseen una gran cantidad de variables, ecuaciones y no linealidades, suelen presentar un considerable desafío matemático. Si bien existen numerosas plataformas de software para su formulación y resolución, muchas poseen costosas licencias propietarias. Además, aun contando con las herramientas más sofisticadas suele necesitarse un considerable esfuerzo de programación (reformulaciones, descomposiciones, etc.) para implementar y resolver este tipo de modelos. En esta tesis se propone confeccionar una herramienta propia de optimización no lineal basada en metaheurísticas, empleando recursos de software libre, que permitan al grupo realizar proyectos de investigación y transferencia sin depender de los costos asociados a las licencias de las herramientas comerciales. En los últimos años, las metaheurísticas basadas en poblaciones han tomado gran relevancia debido a su eficiencia, facilidad de programación, habilidad para resolver una amplia variedad de problemas y posibilidad de combinarse con otros algoritmos para mejorar sus prestaciones. En este trabajo, se implementó una de estas técnicas, la optimización por enjambre de partículas (PSO), para programar y resolver problemas de optimización no lineal. Dado que la optimización con PSO suele resultar computacionalmente costosa, se paralelizó el algoritmo sobre placas gráficas (GPU) de manera de explotar el paralelismo implícito de la técnica y aprovechar el amplio acceso a estos dispositivos de bajo costo disponibles en las computadoras de escritorio modernas. El PSO implementado, en sus versiones serie y paralelo, se testeó con funciones benchmark de diferente dificultad, con y sin restricciones, ampliamente utilizadas en la literatura. También, se lo aplicó a modelos más complejos y de mayor escala del área de la ingeniería química. En todos los casos se observaron desempeños aceptables, tanto respecto de la calidad de las soluciones halladas como de las aceleraciones obtenidas. / Nonlinear optimization problems, with medium/large number of variables, equations and nonlinearities, usually present a significant mathematical challenge. Despite there are many technologies for their formulation and resolution, the most competitive ones, have expensive proprietary licenses. Moreover, even counting with these commercial tools, usually a considerable additional programming effort is required (reformulations, decompositions, etc.) to implement and solve this type of models. This thesis proposes the development of a non-linear optimization tool based on metaheuristics using free software resources, to allow our group making research and transference projects without depending on the costs associated with commercial licenses. In recent years, population based metaheuristics acquired relevance because of their efficiency, ease of programming, ability to solve a wide range of problems and possibility to combine with others algorithms to improve performance. In this work, one of these techniques, the particle swarm optimization algorithm (PSO) is implemented to program and solve non-linear optimization problems. Since optimization with PSO is often computationally expensive, the algorithm was parallelized on Graphic Processing Units (GPU) in order to exploit the implicit parallelism of this technique and take advantage of the wide access to these low-cost devices available in modern desktop computers. The implemented PSO, in its serial and parallel versions, was tested with benchmark functions of different difficulty, with and without constraints, widely used in the optimization literature. It was also applied to more complex and larger-scale models of the chemical engineering discipline. In all cases, the optimizer provided acceptable performance regarding solution quality and speedups. Ingeniería química Ingeniería de procesos Optimización no lineal Optimización numérica GPU Algoritmos Sistema de Posicionamiento Global
787	Real-Time Computed Tomography-based Medical Diagnosis Using Deep Learning Goel, Garvit 24 February 2022 (has links) Computed tomography has been widely used in medical diagnosis to generate accurate images of the body's internal organs. However, cancer risk is associated with high X-ray dose CT scans, limiting its applicability in medical diagnosis and telemedicine applications. CT scans acquired at low X-ray dose generate low-quality images with noise and streaking artifacts. Therefore we develop a deep learning-based CT image enhancement algorithm for improving the quality of low-dose CT images. Our algorithm uses a convolution neural network called DenseNet and Deconvolution network (DDnet) to remove noise and artifacts from the input image. To evaluate its advantages in medical diagnosis, we use DDnet to enhance chest CT scans of COVID-19 patients. We show that image enhancement can improve the accuracy of COVID-19 diagnosis (~5% improvement), using a framework consisting of AI-based tools. For training and inference of the image enhancement AI model, we use heterogeneous computing platform for accelerating the execution and decreasing the turnaround time. Specifically, we use multiple GPUs in distributed setup to exploit batch-level parallelism during training. We achieve approximately 7x speedup with 8 GPUs running in parallel compared to training DDnet on a single GPU. For inference, we implement DDnet using OpenCL and evaluate its performance on multi-core CPU, many-core GPU, and FPGA. Our OpenCL implementation is at least 2x faster than analogous PyTorch implementation on each platform and achieves comparable performance between CPU and FPGA, while FPGA operated at a much lower frequency. / Master of Science / Computed tomography has been widely used in the medical diagnosis of diseases, such as cancer/tumor, viral pneumonia, and more recently, COVID-19. However, the risk of cancer associated with X-ray dose in CT scans limits the use of computed tomography in biomedical imaging. Therefore we develop a deep learning-based image enhancement algorithm that can be used with low X-ray dose computed tomography scanners to generate high-quality CT images. The algorithm uses a state-of-the-art convolution neural network for increased performance and computational efficiency. Further, we use image enhancement algorithm to develop a framework of AI-based tools to improve the accuracy of COVID-19 diagnosis. We test and validate the framework with clinical COVID-19 data. Our framework applies to the diagnosis of COVID-19 and its variants, and other diseases that can be diagnosed via computed tomography. We utilize high-performance computing techniques to reduce the execution time of training and testing AI models in our framework. We also evaluate the efficacy of training and inference of the neural network on heterogeneous computing platforms, including multi-core CPU, many-core GPU, and field-programmable gate arrays (FPGA), in terms of speed and power consumption. AI biomedical imaging corona virus COVID-19 deep learning diagnosis neural networks GPU Field programmable gate arrays
788	Halvtoning för realtidsrendering i dataspelsutveckling : En litteraturundersökning av forskning kring dithering-algoritmer / Halftoning for realtime rendering in computer game development : A literature review of research on dithering algorithms Engström, Erik January 2024 (has links) Det här examensarbetet utforskar användningen av olika halvtoningstekniker, känt som dithering, i realtidsrendering för datorspelsutveckling. Halvtoning är en metod för att skapa illusionen av flera färger i bilder med begränsat färgdjup genom att använda mönster av punkter. Vad som tidigare var en lösning på ett optimeringsproblem i äldre datorer har i modern användning blivit ett stilistiskt val. Med hjälp av en litteraturundersökning fyller studien ett kunskapsgap kring halvtoning som estetiskt verktyg och diskuterar dess tillämplighet i realtidsgrafik för moderna datorspel. En kronologisk genomgång presenteras av kända algoritmer där varje algoritm blir betygsatt efter dess egenskaper. Analysen fokuserar på algoritmernas effektivitet, bildkvalitet, sammanhållning mellan frames, möjligheten att ändra parametrar, och implementation på modern hårdvara, särskilt GPUer. En diskussion förs kring etiska och samhälleliga aspekter och algoritmernas potential för framtida forskning. Halvtoning dithering rendering stiliserad rendering realtid GPU Information Systems, Social aspects
789	Automatic Data Allocation, Buffer Management And Data Movement For Multi-GPU Machines Ramashekar, Thejas 10 1900 (has links) (PDF) Multi-GPU machines are being increasingly used in high performance computing. These machines are being used both as standalone work stations to run computations on medium to large data sizes (tens of gigabytes) and as a node in a CPU-Multi GPU cluster handling very large data sizes (hundreds of gigabytes to a few terabytes). Each GPU in such a machine has its own memory and does not share the address space either with the host CPU or other GPUs. Hence, applications utilizing multiple GPUs have to manually allocate and managed at a on each GPU. A significant body of scientific applications that utilize multi-GPU machines contain computations inside affine loop nests, i.e., loop nests that have affine bounds and affine array access functions. These include stencils, linear-algebra kernels, dynamic programming codes and data-mining applications. Data allocation, buffer management, and coherency handling are critical steps that need to be performed to run affine applications on multi-GPU machines. Existing works that propose to automate these steps have limitations and in efficiencies in terms of allocation sizes, exploiting reuse, transfer costs and scalability. An automatic multi-GPU memory manager that can overcome these limitations and enable applications to achieve salable performance is highly desired. One technique that has been used in certain memory management contexts in the literature is that of bounding boxes. The bounding box of an array, for a given tile, is the smallest hyper-rectangle that encapsulates all the array elements accessed by that tile. In this thesis, we exploit the potential of bounding boxes for memory management far beyond their current usage in the literature. In this thesis, we propose a scalable and fully automatic data allocation and buffer management scheme for affine loop nests on multi-GPU machines. We call it the Bounding Box based Memory Manager (BBMM). BBMM is a compiler-assisted runtime memory manager. At compile time, it use static analysis techniques to identify a set of bounding boxes accessed by a computation tile. At run time, it uses the bounding box set operations such as union, intersection, difference, finding subset and superset relation to compute a set of disjoint bounding boxes from the set of bounding boxes identified at compile time. It also exploits the architectural capability provided by GPUs to perform fast transfers of rectangular (strided) regions of memory and hence performs all data transfers in terms of bounding boxes. BBMM uses these techniques to automatically allocate, and manage data required by applications (suitably tiled and parallelized for GPUs). This allows It to (1) allocate only as much data (or close to) as is required by computations running on each GPU, (2) efficiently track buffer allocations and hence, maximize data reuse across tiles and minimize the data transfer overhead, (3) and as a result, enable applications to maximize the utilization of the combined memory on multi-GPU machines. BBMM can work with any choice of parallelizing transformations, computation placement, and scheduling schemes, whether static or dynamic. Experiments run on a system with four GPUs with various scientific programs showed that BBMM is able to reduce data allocations on each GPU by up to 75% compared to current allocation schemes, yield at least 88% of the performance of hand-optimized Open CL codes and allows excellent weak scaling. High Performance Computing Computer Memory Management Multi-GPU Memory Manager Automatic Data Allocation Data Transfer Buffer Management Affine Loop Nests Bounding Box Based Memory Manager GPU Architecture Data Movement Code Box Based Memory Manager (BBMM) Computer Science
790	Runtime specialization for heterogeneous CPU-GPU platforms Farooqui, Naila 27 May 2016 (has links) Heterogeneous parallel architectures like those comprised of CPUs and GPUs are a tantalizing compute fabric for performance-hungry developers. While these platforms enable order-of-magnitude performance increases for many data-parallel application domains, there remain several open challenges: (i) the distinct execution models inherent in the heterogeneous devices present on such platforms drives the need to dynamically match workload characteristics to the underlying resources, (ii) the complex architecture and programming models of such systems require substantial application knowledge and effort-intensive program tuning to achieve high performance, and (iii) as such platforms become prevalent, there is a need to extend their utility from running known regular data-parallel applications to the broader set of input-dependent, irregular applications common in enterprise settings. The key contribution of our research is to enable runtime specialization on such hybrid CPU-GPU platforms by matching application characteristics to the underlying heterogeneous resources for both regular and irregular workloads. Our approach enables profile-driven resource management and optimizations for such platforms, providing high application performance and system throughput. Towards this end, this research: (a) enables dynamic instrumentation for GPU-based parallel architectures, specifically targeting the complex Single-Instruction Multiple-Data (SIMD) execution model, to gain real-time introspection into application behavior; (b) leverages such dynamic performance data to support novel online resource management methods that improve application performance and system throughput, particularly for irregular, input-dependent applications; (c) automates some of the programmer effort required to exercise specialized architectural features of such platforms via instrumentation-driven dynamic code optimizations; and (d) proposes a specialized, affinity-aware work-stealing scheduling runtime for integrated CPU-GPU processors that efficiently distributes work across all CPU and GPU cores for improved load balance, taking into account both application characteristics and architectural differences of the underlying devices. Dynamic instrumentation Dynamic compilation GPU computing Heterogeneous computing Profile-guided optimizations Program analysis Workload characterization Compiler Runtime Multicore CUDA OpenCL SIMD

Search results