• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 2
  • Tagged with
  • 2
  • 2
  • 2
  • 2
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
1

GPU Parallelization of Astronomical Image Subtraction / GPU-parallelisering av astronomisk bildsubtraction

Arneving, Gustav, Wilhelmsson, Hugo January 2024 (has links)
Astronomical image subtraction is a method for generating a difference image from two images, which covers the same area but taken at different times, in order to see changes over time. Due to the images being taken at different times, one of the images has to be convolved, to match the atmospheric conditions ofthe other image. HOTPANTS is an open source software used for astronomical image subtraction. The problem is that HOTPANTS is written in serial C and therefore does not scale well with growing image sizes. There have been previous efforts to parallelize HOTPANTS, which include P-HOTPANTS and GBAISP. However, these projects are outdated or unavailable, respectively. The latest effort, BACH, is a reimplementation of HOTPANTS in C++, where the convolution and subtraction parts have been parallelized on a GPU using OpenCL. This thesis project is a continuation of BACH, called X-BACH, which aims to parallelize the remaining parts of the HOTPANTS algorithm using OpenCL. The results show that some parts of the HOTPANTS algorithm, excluding convolution and subtraction, are highly suitable for the GPU while other parts arenot suitable for the GPU. It is believed that some parts which are not suitable forthe GPU are highly suitable for CPU parallelization. Overall, running on an external GPU, X-BACH achieves a relative speed of 1 to 2 compared to BACH, and a relative of 0.8 to 2.5 compared to HOTPANTS. When running on an integrated GPU, X-BACH achieves a relative speed of 0.5 to 1.2 compared to BACH, and a relative speed of 0.3 to 2 compared to HOTPANTS. Some parts of the algorithm achieves a speedup of up to 10 times when parallelized on a GPU. In terms of accuracy, X-BACH generally obtains a maximum relative error in order of magnitude ranging from 10−7 to 10−1. However, on certain test images, the algorithm has been observed to be unstable.
2

Toward Highly-efficient GPU-centric Networking / Mot Högeffektiva GPU-centrerade Nätverk

Girondi, Massimo January 2024 (has links)
Graphics Processing Units (GPUs) are emerging as the most popular accelerator for many applications, powering the core of Machine Learning applications and many computing-intensive workloads. GPUs have typically been consideredas accelerators, with Central Processing Units (CPUs) in charge of the mainapplication logic, data movement, and network connectivity. In these architectures,input and output data of network-based GPU-accelerated application typically traverse the CPU, and the Operating System network stack multiple times, getting copied across the system main memory. These increase application latency and require expensive CPU cycles, reducing the power efficiency of systems, and increasing the overall response times. These inefficiencies become of higher importance in latency-bounded deployments, or with high throughput, where copy times could easily inflate the response time of modern GPUs. The main contribution of this dissertation is towards a GPU-centric network architecture, allowing GPUs to initiate network transfers without the intervention of CPUs. We focus on commodity hardware, using NVIDIA GPUs and Remote Direct Memory Access over Converged Ethernet (RoCE) to realize this architecture, removing the need of highly homogeneous clusters and ad-hoc designed network architecture, as it is required by many other similar approaches. By porting some rdma-core posting routines to GPU runtime, we can saturate a 100-Gbps link without any CPU cycle, reducing the overall system response time, while increasing the power efficiency and improving the application throughput.The second contribution concerns the analysis of Clockwork, a State-of-The-Art inference serving system, showing the limitations imposed by controller-centric, CPU-mediated architectures. We then propose an alternative architecture to this system based on an RDMA transport, and we study some performance gains that such a system would introduce. An integral component of an inference system is to account and track user flows,and distribute them across multiple worker nodes. Our third contribution aims to understand the challenges of Connection Tracking applications running at 100Gbps, in the context of a Stateful Load Balancer running on commodity hardware. / <p>QC 20240315</p>

Page generated in 0.0437 seconds