Global ETD Search

1	Designing RDMA-based efficient Communication for GPU Remoting Bhandare, Shreya Amit 24 August 2023 (has links) The use of General Purpose Graphics Processing Units (GPGPUs) has become crucial for accelerating high-performance applications. However, the procurement, setup, and maintenance of GPUs can be costly, and their continuous energy consumption poses additional challenges. Moreover, many applications exhibit suboptimal GPU utilization. To address these concerns, GPU virtualization techniques have been proposed. Among them, GPU Remoting stands out as a promising technology that enables applications to transparently harness the computational capabilities of GPUs remotely. GVirtuS, a GPU Remoting software, facilitates transparent and hypervisor-independent access to GPGPUs within virtual machines. This research focuses on the middleware communication layer implemented in GVirtuS and presents a comprehensive redesign that leverages the power of Remote Direct Memory Access (RDMA) technology. Experimental evaluations, conducted using a matrix multiplication application, demonstrate that the newly proposed protocol achieves approximately 50% reduced execution time for data sizes ranging from 1 to 16MB, and around 12% decreased execution time for sizes ranging from 500 to upto 1GB. These findings highlight the significant performance improvements attained through the redesign of the communication layer in GVirtuS, showcasing its potential for enhancing GPU Remoting efficiency. / Master of Science / General Purpose Graphics Processing Units (GPGPUs) have become essential tools for accelerating high-performance applications. However, the acquisition and maintenance of GPUs can be expensive, and their continuous energy consumption adds to the overall costs. Additionally, many applications often underutilize the full potential of GPUs. To tackle these challenges, researchers have proposed GPU virtualization techniques. One such promising approach is GPU Remoting, which enables applications to seamlessly utilize GPUs remotely. GVirtuS, a GPU Remoting software, allows virtual machines to access GPGPUs in a transparent and independent manner from the underlying system. This study focuses on enhancing the communication layer in GVirtuS, which facilitates efficient interaction between virtual machines and GPUs. By leveraging advanced technology called Remote Direct Memory Access (RDMA), we achieved significant improvements in performance. Evaluations using a matrix multiplication application showed a reduction of approximately 50% in execution time for small data sizes (1-16MB) and around 12% for larger sizes (500-800MB). These findings highlight the potential of our redesign to enhance GPU virtualization, leading to better performance and cost-efficiency in various applications. GPGPU Virtualization RDMA CUDA
2	High-Performance Network Data Transfers to GPU : A Study of Nvidia GPU Direct RDMA and GPUNetIO Gao, Yuchen January 2023 (has links) This study investigates high-performance network data transfers, focusing on Nvidia Graphics Processing Unit (GPU) Direct Remote Direct Memory Access (RDMA) and GPUNetIO. These methods have emerged as promising strategies for improving data communication between GPUs and network interfaces, but harnessing their potential requires meticulous configuration and optimization. This research aims to clarify those architectures and achieve optimal performance in this context. The study begins with analyzing the source code for both architectures, explaining their underlying principles and what they have improved on the previous structures. A useroriented testing tool is also developed to provide users with a simplified interface for conducting tests and system configuration requirements. The research methodology consists of reviewing the literature and analyzing the source code of GPUDirect RDMA and GPUNetIO. Additionally, experiments are designed to evaluate various performance aspects, ranging from Central Processing Unit (CPU)- related factors to GPU metrics and network card performance. The results indicate a significant acceleration in data copying when based on GPUDirect RDMA technology. The introduction of GPUNetIO leads to a substantial decrease in CPU utilization. Furthermore, the user interface is designed for simple deployment on hosts and easy access by users. The interface is equipped with the recommended configuration settings. / Denna studie undersöker högpresterande nätverksdataöverföringar med fokus på Nvidia GPU Direct RDMA och GPUNetIO. GPU Direct RDMA har visat sig vara en lovande metod för att förbättra datakommunikationen mellan GPU:er och nätverksgränssnitt, men för att utnyttja dess potential krävs noggrann konfiguration och optimering. Denna forskning syftar till att klargöra komplexiteten och svårigheterna med att uppnå optimal prestanda i detta sammanhang. Studien inleds med en analys av källkoden för båda arkitekturerna, som förklarar deras underliggande principer och vad de har förbättrat jämfört med de tidigare strukturerna. Dessutom utvecklas ett användarorienterat testverktyg som syftar till att ge användarna ett förenklat gränssnitt för att utföra tester. Forskningsmetoden består av en genomgång av litteraturen och en analys av källkoden för GPUDirect RDMA och GPUNetIO. Dessutom har en uppsättning experiment utformats för att utvärdera olika prestandaaspekter, allt från CPU-relaterade faktorer till GPU-mätvärden och nätverkskortsprestanda. Resultaten indikerar en betydande acceleration av datakopieringen när den baseras på GPUDirect RDMA-teknik. Införandet av GPUNetIO leder till en betydande minskning av CPU-användningen. Dessutom är användargränssnittet utformat för enkel driftsättning på värdar och enkel åtkomst för användare. Gränssnittet är utrustat med rekommenderade konfigurationsinställningar. GPUDirect RDMA GPUNetIO User interface GPUDirect RDMA GPUNetIO Användargränssnitt Computer and Information Sciences Data- och informationsvetenskap
3	Profile, Monitor, and Introspect Spark Jobs Using OSU INAM Kedia, Mansa January 2020 (has links) No description available. Computer Science HPC OSU INAM Apache Spark RDMA-Spark
4	Benchmarking and Accelerating TensorFlow-based Deep Learning on Modern HPC Systems Biswas, Rajarshi 12 October 2018 (has links) No description available. Computer Engineering Computer Science Deep Learning TensorFlow gRPC HPC RDMA
5	Scaling RDMA RPCs with FLOCK Monga, Sumit Kumar 30 November 2021 (has links) RDMA-capable networks are gaining traction with datacenter deployments due to their high throughput, low latency, CPU efficiency, and advanced features, such as remote memory operations. However, efficiently utilizing RDMA capability in a common setting of high fan-in, fan-out asymmetric network topology is challenging. For instance, using RDMA programming features comes at the cost of connection scalability, which does not scale with increasing cluster size. To address that, several works forgo some RDMA features by only focusing on conventional RPC APIs. In this work, we strive to exploit the full capability of RDMA, while scaling the number of connections regardless of the cluster size. We present FLOCK, a communication framework for RDMA networks that uses hardware provided reliable connection. Using a partially shared model, FLOCK departs from the conventional RDMA design by enabling connection sharing among threads, which provides significant performance improvements contrary to the widely held belief that connection sharing deteriorates performance. At its core, FLOCK uses a connection handle abstraction for connection multiplexing; a new coalescing-based synchronization approach for efficient network utilization; and a load-control mechanism for connections with symbiotic send-recv scheduling, which reduces the synchronization overheads associated with connection sharing along with ensuring fair utilization of network connections. / M.S. / Internet is one of the great discoveries of our time. It provides access to enormous knowledge sources, makes it easier to communicate across the globe seamlessly with other countless advantages. Accessing the internet over the years, it is noticeable that the latency of services like web searches and downloading files has gone down sharply. A download that used to take minutes during the 2000s can complete within seconds in present times. Network speeds have been improving, facilitating a faster and smoother user experience. Another factor contributing to the improved internet experience is the service providers like Google, Amazon, and others that can process user requests in a fraction of time what used to take before. Web services such as search, e-commerce are implemented using a multi-layer architecture with layer containing hundreds to thousands of servers. Each server runs one or more components of the web service application. In this architecture, user requests are received in the upper layer and processed by the lower layers. Servers in different layers communicate over an ultrafast network like Remote Direct Memory Access (RDMA). The implication of the multi-layer architecture is that a server has to communicate with multiple other servers in the upper and lower layers. Unfortunately, due to its inherent limitations, RDMA does not perform well when network communication takes place with a large number of servers. In this thesis, a new communication framework for RDMA networks, FLOCK is proposed to overcome the scalability limitations of RDMA hardware. FLOCK maintains scalability when communicating with many servers and it consistently provides better performance compared to the state-of-the-art. Additionally, FLOCK utilizes the network bandwidth efficiently and reduces the CPU overheads incurred due to network communication. Datacenter networking Remote Direct Memory Access (RDMA) Scalability
6	Designing Scalable Storage Systems for Non-Volatile Memory Gugnani, Shashank January 2020 (has links) No description available. Computer Science
7	Adapting Remote Direct Memory Access Based File System to Parallel Input-/Output Velusamy, Vijay 13 December 2003 (has links) Traditional file access interfaces rely on ubiquitous transports that impose severe restrictions on performance and prove insufficient for adaptation to parallel Input/Output (I/O). Remote Direct Memory Access based (RDMA-based) approaches are aimed at moving data between different process address spaces with streamlined mediation and reduced involvement of the operating system using synchronization semantics that are different from ubiquitous transports. This thesis studies the adaptability of RDMA-based transports to parallel I/O. Combining RDMA semantics with parallel I/O leads to overhead reduction by overlapping communication and computation and by bandwidth enhancement. Although parallel I/O tends to increase latency in certain cases, use of RDMA techniques mitigate on this effect. DAFS RDMA MercutIO Parallel I/O MPI-IO MPI-2
8	RDMA-based Plugin Design and Profiler for Apache and Enterprise Hadoop Distributed File system Bhat, Adithya January 2015 (has links) No description available. Computer Science HDFS Hadoop RDMA InfiniBand Apache HDP CDH
9	Enhancing MPI with modern networking mechanisms in cluster interconnects Yu, Weikuan 12 September 2006 (has links) No description available. Computer Science InfiniBand Myrinet Quadrics MPI Parallel IO RDMA
10	Evaluating and Improving the Performance of MPI-Allreduce on QLogic HTX/PCIe InifiniBand HCA Mittenzwey, Nico 30 June 2009 (has links) (PDF) This thesis analysed the QLogic InﬁniPath QLE7140 HCA and its onload architecture and compared the results to the Mellanox InﬁniHost III Lx HCA which uses an oﬄoad architecture. As expected, the QLogic InﬁniPath QLE7140 HCA can outperform the Mellanox InﬁniHost III Lx HCA in latency and bandwidth terms on our test system in various test scenarios. The benchmarks showed, that sending messages with multiple threads in parallel can increase the bandwidth greatly while bi-directional sends cut the eﬀective bandwidth for one HCA by up to 30%. Diﬀerent all-reduce algorithms where evaluated and compared with the help of the LogGP model. The comparison showed that new all-reduce algorithms can outperform the ones already implemented in Open MPI for diﬀerent scenarios. The thesis also demonstrated, that one can implement multicast algorithms for InﬁniBand easily by using the RDMA-CM API. InfiniBand MPI_Allreduce Netzwerk OFED Open MPI PSM RDMA-CM ddc:004 Hochleistungsrechnen Parallelrechner

Search results