Return to search

CRIU-RTX: Remote Thread eXecution using Checkpoint/Restore in Userspace

Scaling up application performance on single high-end machines is increasingly becoming difficult due to scalability challenges of processor interconnects, cache coherence protocols, and memory bandwidth. Significant prior work has addressed this problem by scaling-out application threads across multiple nodes to exploit resources outside the single machine boundary. Prior works have also leveraged heterogeneous instruction set architecture (ISA) systems to improve application performance as well as energy-efficiency, a major cost driver in datacenters, by augmenting high-end servers with power-efficient embedded boards. Existing works, however, suffer from deployability challenges due to dependencies on the operating system or programming models that require non-trivial application modifications. We introduce CRIU-RTX, a userspace framework to scale-out multi-threaded applications across multiple nodes. Integrated with HetMigrate, a prior work on migrating processes across heterogeneous-ISA systems, CRIU-RTX can suspend a subset of threads in a process and resume their execution on different nodes, including, but not limited to heterogeneous-ISA nodes. CRIU-RTX implements distributed shared memory in userspace, thereby allowing application threads to access distributed memory transparently without any operating system dependency. Our experimental evaluations show 21% to 43% performance gains while scaling-out applications across x86-64 servers, and energy efficiency gains of up to 18% while scaling-out across a cluster of x86-64 server and ARM64 embedded boards. Since CRIU-RTX does not depend on operating system modifications, it can be easily deployed on a diverse set of machines, including, but not limited to ISA-different machines running the stock Linux operating system. / Master of Science / Commonly referred to as "Moore's Law", Gordan Moore predicted that the number of transistors on a chip would double every two years. However, this law no longer holds true, leading to a shift in computer research and development. To meet the increasing demands for faster and cheaper servers, researchers began exploring alternative computer designs. Data centers have started adopting servers with diverse architectures to enhance the cost-to-performance ratio, resulting in heterogeneous environments. Distributed execution refers to the process of running computational tasks or executing software across multiple interconnected systems or nodes. Instead of relying on a single machine or processor, the workload is distributed among a network of computers, allowing for parallel processing and improved performance. Prior works in this direction had difficulty in adoption due to customized hardware or operating system requirements. This thesis introduces CRIU-RTX, a userspace framework to scale-out application threads without operating system dependency. We implemented a distributed shared memory system in userspace to allow application threads running in scaled-out execution to access distributed memory as if they are running on the same machine. Our evaluations of CRIU-RTX show significant improvement in performance and energy-efficiency.

Identiferoai:union.ndltd.org:VTETD/oai:vtechworks.lib.vt.edu:10919/115819
Date21 July 2023
CreatorsNoor Mohamed, Mohamed Husain
ContributorsElectrical and Computer Engineering, Ravindran, Binoy, Giles, Kendall Everett, Wang, Xiaoguang
PublisherVirginia Tech
Source SetsVirginia Tech Theses and Dissertation
LanguageEnglish
Detected LanguageEnglish
TypeThesis
FormatETD, application/pdf
RightsIn Copyright, http://rightsstatements.org/vocab/InC/1.0/

Page generated in 0.0018 seconds