1 |
CRIU-RTX: Remote Thread eXecution using Checkpoint/Restore in UserspaceNoor Mohamed, Mohamed Husain 21 July 2023 (has links)
Scaling up application performance on single high-end machines is increasingly becoming difficult due to scalability challenges of processor interconnects, cache coherence protocols, and memory bandwidth. Significant prior work has addressed this problem by scaling-out application threads across multiple nodes to exploit resources outside the single machine boundary. Prior works have also leveraged heterogeneous instruction set architecture (ISA) systems to improve application performance as well as energy-efficiency, a major cost driver in datacenters, by augmenting high-end servers with power-efficient embedded boards. Existing works, however, suffer from deployability challenges due to dependencies on the operating system or programming models that require non-trivial application modifications. We introduce CRIU-RTX, a userspace framework to scale-out multi-threaded applications across multiple nodes. Integrated with HetMigrate, a prior work on migrating processes across heterogeneous-ISA systems, CRIU-RTX can suspend a subset of threads in a process and resume their execution on different nodes, including, but not limited to heterogeneous-ISA nodes. CRIU-RTX implements distributed shared memory in userspace, thereby allowing application threads to access distributed memory transparently without any operating system dependency. Our experimental evaluations show 21% to 43% performance gains while scaling-out applications across x86-64 servers, and energy efficiency gains of up to 18% while scaling-out across a cluster of x86-64 server and ARM64 embedded boards. Since CRIU-RTX does not depend on operating system modifications, it can be easily deployed on a diverse set of machines, including, but not limited to ISA-different machines running the stock Linux operating system. / Master of Science / Commonly referred to as "Moore's Law", Gordan Moore predicted that the number of transistors on a chip would double every two years. However, this law no longer holds true, leading to a shift in computer research and development. To meet the increasing demands for faster and cheaper servers, researchers began exploring alternative computer designs. Data centers have started adopting servers with diverse architectures to enhance the cost-to-performance ratio, resulting in heterogeneous environments. Distributed execution refers to the process of running computational tasks or executing software across multiple interconnected systems or nodes. Instead of relying on a single machine or processor, the workload is distributed among a network of computers, allowing for parallel processing and improved performance. Prior works in this direction had difficulty in adoption due to customized hardware or operating system requirements. This thesis introduces CRIU-RTX, a userspace framework to scale-out application threads without operating system dependency. We implemented a distributed shared memory system in userspace to allow application threads running in scaled-out execution to access distributed memory as if they are running on the same machine. Our evaluations of CRIU-RTX show significant improvement in performance and energy-efficiency.
|
2 |
A Generalized Framework for Automatic Code Partitioning and Generation in Distributed SystemsSairaman, Viswanath 05 February 2010 (has links)
In distributed heterogeneous systems the partitioning of application software to be executed in a distributed fashion is a challenge by itself. The task of code partitioning for distributed processing involves partitioning the code into clusters and mapping those code clusters to the individual processing elements interconnected through a high speed network. Code generation is the process of converting the code partitions into individually executable code clusters and satisfying the code dependencies by adding communication primitives to send and receive data between dependent code clusters. In this work, we describe a generalized framework for automatic code partitioning and code generation for distributed heterogeneous systems. A model for system level design and synthesis using transaction level models has also been developed and is presented. The application programs along with the partition primitives are converted into independently executable concrete implementations. The process consists of two steps, first translating the primitives of the application program into equivalent code clusters, and then scheduling the implementations of these code clusters according to the inherent data dependencies. Further, the original source code needs to be reverse engineered in order to create a meta-data table describing the program elements and dependency trees. The data gathered, is used along with Parallel Virtual Machine (PVM) primitives for enabling the communication between the partitioned programs in the distributed environment. The framework consists of profiling tools, partitioning methodology, architectural exploration and cost analysis tools. The partitioning algorithm is based on clustering, in which the code clusters are created to minimize communication overhead represented as data transfers in task graph for the code. The proposed approach has been implemented and tested for different applications and compared with simulated annealing and tabu search based partitioning algorithms. The objective of partitioning is to minimize the communication overhead. While the proposed approach performs comparably with simulated annealing and better than tabu search based approaches in most cases in terms of communication overhead reduction, it is conclusively faster than simulated annealing and tabu search by an order of magnitude as indicated by simulation results. The proposed framework for system level design/synthesis provides an end to end rapid prototyping approach for aiding in architectural exploration and design optimization. The level of abstraction in the design phase can be fine tuned using transaction level models.
|
Page generated in 0.1082 seconds