Modern High Performance Computing (HPC) systems are made up of thousands of server-grade compute nodes linked through a high-speed network interconnect. Each node has tens or even hundreds of CPU cores each, with counts continuing to grow on newer HPC clusters. This results in a need to make use of millions of cores per cluster. Fully leveraging these resources is difficult. There is an active need to design software that scales and fully utilizes the hardware. In this dissertation, we address this gap with a dual approach, considering both intra-node (single node) and inter-node (across node) concerns. To aid in intra-node performance, we propose two novel concurrent data structures: a transactional vector and a persistent hash map. These designs have broad applicability in any multi-core environment but are particularly useful in HPC, which commonly features many cores per node. For inter-node performance, we propose a metrics-driven approach to improve scheduling quality, using predicted run times to backfill jobs more accurately and aggressively. This is augmented using application input parameters to further improve these run time predictions. Improved scheduling reduces the number of idle nodes in an HPC cluster, maximizing job throughput. We find that our data structures outperform the prior state-of-the-art while offering additional features. Our backfill technique likewise outperforms previous approaches in simulations, and our run time predictions were significantly more accurate than conventional approaches. Code for these works is freely available, and we have plans to deploy these techniques more broadly on real HPC systems in the future.
Identifer | oai:union.ndltd.org:ucf.edu/oai:stars.library.ucf.edu:etd2023-1498 |
Date | 01 January 2024 |
Creators | Lamar, Kenneth M |
Publisher | STARS |
Source Sets | University of Central Florida |
Language | English |
Detected Language | English |
Type | text |
Format | application/pdf |
Source | Graduate Thesis and Dissertation 2023-2024 |
Page generated in 0.0021 seconds