Global ETD Search

81	Toward Highly-efficient GPU-centric Networking / Mot Högeffektiva GPU-centrerade Nätverk Girondi, Massimo January 2024 (has links) Graphics Processing Units (GPUs) are emerging as the most popular accelerator for many applications, powering the core of Machine Learning applications and many computing-intensive workloads. GPUs have typically been consideredas accelerators, with Central Processing Units (CPUs) in charge of the mainapplication logic, data movement, and network connectivity. In these architectures,input and output data of network-based GPU-accelerated application typically traverse the CPU, and the Operating System network stack multiple times, getting copied across the system main memory. These increase application latency and require expensive CPU cycles, reducing the power efficiency of systems, and increasing the overall response times. These inefficiencies become of higher importance in latency-bounded deployments, or with high throughput, where copy times could easily inflate the response time of modern GPUs. The main contribution of this dissertation is towards a GPU-centric network architecture, allowing GPUs to initiate network transfers without the intervention of CPUs. We focus on commodity hardware, using NVIDIA GPUs and Remote Direct Memory Access over Converged Ethernet (RoCE) to realize this architecture, removing the need of highly homogeneous clusters and ad-hoc designed network architecture, as it is required by many other similar approaches. By porting some rdma-core posting routines to GPU runtime, we can saturate a 100-Gbps link without any CPU cycle, reducing the overall system response time, while increasing the power efficiency and improving the application throughput.The second contribution concerns the analysis of Clockwork, a State-of-The-Art inference serving system, showing the limitations imposed by controller-centric, CPU-mediated architectures. We then propose an alternative architecture to this system based on an RDMA transport, and we study some performance gains that such a system would introduce. An integral component of an inference system is to account and track user flows,and distribute them across multiple worker nodes. Our third contribution aims to understand the challenges of Connection Tracking applications running at 100Gbps, in the context of a Stateful Load Balancer running on commodity hardware. / <p>QC 20240315</p> Low-Latency Internet Services Packet Processing Network Functions Virtualization Middle Boxes Commodity Hardware Multi-Hundred-Gigabit-Per-Second Low-Level Optimization Graphics Processing Units Inference Serving Remote Direct Memory Access Internettjänster med Låg Fördröjning Paketbearbetning Virtualisering av Nätverksfunktioner Mellanutrustning Tillgänglig Datorhårdvara Flera-Hundra- Gigabit-Per-Sekund Lågnivå-Optimering Grafikprocessor Inferensserving Remote Direct Memory Access Communication Systems Kommunikationssystem Computer Systems Datorsystem
82	High performance lattice Boltzmann solvers on massively parallel architectures with applications to building aeraulics Obrecht, Christian 11 December 2012 (has links) (PDF) With the advent of low-energy buildings, the need for accurate building performance simulations has significantly increased. However, for the time being, the thermo-aeraulic effects are often taken into account through simplified or even empirical models, which fail to provide the expected accuracy. Resorting to computational fluid dynamics seems therefore unavoidable, but the required computational effort is in general prohibitive. The joint use of innovative approaches such as the lattice Boltzmann method (LBM) and massively parallel computing devices such as graphics processing units (GPUs) could help to overcome these limits. The present research work is devoted to explore the potential of such a strategy. The lattice Boltzmann method, which is based on a discretised version of the Boltzmann equation, is an explicit approach offering numerous attractive features: accuracy, stability, ability to handle complex geometries, etc. It is therefore an interesting alternative to the direct solving of the Navier-Stokes equations using classic numerical analysis. From an algorithmic standpoint, the LBM is well-suited for parallel implementations. The use of graphics processors to perform general purpose computations is increasingly widespread in high performance computing. These massively parallel circuits provide up to now unrivalled performance at a rather moderate cost. Yet, due to numerous hardware induced constraints, GPU programming is quite complex and the possible benefits in performance depend strongly on the algorithmic nature of the targeted application. For LBM, GPU implementations currently provide performance two orders of magnitude higher than a weakly optimised sequential CPU implementation. The present thesis consists of a collection of nine articles published in international journals and proceedings of international conferences (the last one being under review). These contributions address the issues related to single-GPU implementations of the LBM and the optimisation of memory accesses, as well as multi-GPU implementations and the modelling of inter-GPU and internode communication. In addition, we outline several extensions to the LBM, which appear essential to perform actual building thermo-aeraulic simulations. The test cases we used to validate our codes account for the strong potential of GPU LBM solvers in practice. High performance computing Lattice Boltzmann method Graphics processing units Building aeraulics

Search results

Toward Highly-efficient GPU-centric Networking / Mot Högeffektiva GPU-centrerade Nätverk

High performance lattice Boltzmann solvers on massively parallel architectures with applications to building aeraulics