• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 1
  • Tagged with
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
1

Maximizing I/O Bandwidth for Out-of-Core HPC Applications on Homogeneous and Heterogeneous Large-Scale Systems

Alturkestani, Tariq 30 September 2020 (has links)
Out-of-Core simulation systems often produce a massive amount of data that cannot t on the aggregate fast memory of the compute nodes, and they also require to read back these data for computation. As a result, I/O data movement can be a bottleneck in large-scale simulations. Advances in memory architecture have made it feasible and a ordable to integrate hierarchical storage media on large-scale systems, starting from the traditional Parallel File Systems (PFSs) to intermediate fast disk technologies (e.g., node-local and remote-shared NVMe and SSD-based Burst Bu ers) and up to CPU main memory and GPU High Bandwidth Memory (HBM). However, while adding additional and faster storage media increases I/O bandwidth, it pressures the CPU, as it becomes responsible for managing and moving data between these layers of storage. Simulation systems are thus vulnerable to being blocked by I/O operations. The Multilayer Bu er System (MLBS) proposed in this research demonstrates a general and versatile method for overlapping I/O with computation that helps to ameliorate the strain on the processors through asynchronous access. The main idea consists in decoupling I/O operations from computational phases using dedicated hardware resources to perform expensive context switches. MLBS monitors I/O tra c in each storage layer allowing fair utilization of shared resources. By continually prefetching up and down across all hardware layers of the memory and storage subsystems, MLBS transforms the original I/O-bound behavior of evaluated applications and shifts it closer to a memory-bound or compute-bound regime. The evaluation on the Cray XC40 Shaheen-2 supercomputer for a representative I/Obound application, seismic inversion, shows that MLBS outperforms state-of-the-art PFSs, i.e., Lustre, Data Elevator and DataWarp by 6.06X, 2.23X, and 1.90X, respectively. On the IBM-built Summit supercomputer, using 2048 compute nodes equipped with a total of 12288 GPUs, MLBS achieves up to 1.4X performance speedup compared to the reference PFS-based implementation. MLBS is also demonstrated on applications from cosmology, combustion, and a classic out-of-core computational physics and linear algebra routines.

Page generated in 0.0272 seconds