Return to search

Maximizing I/O Bandwidth for Out-of-Core HPC Applications on Homogeneous and Heterogeneous Large-Scale Systems

Out-of-Core simulation systems often produce a massive amount of data that cannot
t on the aggregate fast memory of the compute nodes, and they also require to
read back these data for computation. As a result, I/O data movement can be a
bottleneck in large-scale simulations. Advances in memory architecture have made
it feasible and a ordable to integrate hierarchical storage media on large-scale systems,
starting from the traditional Parallel File Systems (PFSs) to intermediate fast
disk technologies (e.g., node-local and remote-shared NVMe and SSD-based Burst
Bu ers) and up to CPU main memory and GPU High Bandwidth Memory (HBM).
However, while adding additional and faster storage media increases I/O bandwidth,
it pressures the CPU, as it becomes responsible for managing and moving data between
these layers of storage. Simulation systems are thus vulnerable to being blocked
by I/O operations. The Multilayer Bu er System (MLBS) proposed in this research
demonstrates a general and versatile method for overlapping I/O with computation
that helps to ameliorate the strain on the processors through asynchronous access.
The main idea consists in decoupling I/O operations from computational phases using
dedicated hardware resources to perform expensive context switches. MLBS monitors
I/O tra c in each storage layer allowing fair utilization of shared resources. By
continually prefetching up and down across all hardware layers of the memory and
storage subsystems, MLBS transforms the original I/O-bound behavior of evaluated
applications and shifts it closer to a memory-bound or compute-bound regime. The evaluation on the Cray XC40 Shaheen-2 supercomputer for a representative I/Obound
application, seismic inversion, shows that MLBS outperforms state-of-the-art
PFSs, i.e., Lustre, Data Elevator and DataWarp by 6.06X, 2.23X, and 1.90X, respectively.
On the IBM-built Summit supercomputer, using 2048 compute nodes equipped
with a total of 12288 GPUs, MLBS achieves up to 1.4X performance speedup compared
to the reference PFS-based implementation. MLBS is also demonstrated on
applications from cosmology, combustion, and a classic out-of-core computational
physics and linear algebra routines.

Identiferoai:union.ndltd.org:kaust.edu.sa/oai:repository.kaust.edu.sa:10754/665396
Date30 September 2020
CreatorsAlturkestani, Tariq
ContributorsKeyes, David E., Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, Shihada, Basem, Moshkov, Mikhail, Sun, Xian-He
Source SetsKing Abdullah University of Science and Technology
LanguageEnglish
Detected LanguageEnglish
TypeDissertation
Rights2021-10-01, At the time of archiving, the student author of this dissertation opted to temporarily restrict access to it. The full text of this dissertation will become available to the public after the expiration of the embargo on 2021-10-01.

Page generated in 0.032 seconds