Return to search

Machine learning-based performance analytics for high-performance computing systems

High-performance Computing (HPC) systems play pivotal roles in societal and scientific advancements, executing up to quintillions of calculations every second. As we shift towards exascale computing and beyond, modern HPC systems emphasize resource sharing, where various applications share processors, memory, networks, and other components. While this sharing enhances power efficiency, it complicates performance prediction and introduces significant variations in application running times, affecting overall system efficiency and operational costs.

HPC systems utilize monitoring frameworks that gather numerical telemetry data on resource usage to track operational status. Given the massive complexity and volume of this data, manual analysis is often daunting and inefficient. Machine learning (ML) techniques offer automated performance anomaly diagnosis, but the transition from successful research outcomes to production-scale deployment encounters two critical obstacles. First, the scarcity of labeled training data (i.e., identifying healthy and anomalous runs) in telemetry datasets makes it hard to train these ML systems effectively. Second, runtime analysis, required for providing timely detection and diagnosis of performance anomalies, demands seamless integration of ML-based methods with the monitoring frameworks.

This thesis claims that ML-based performance analytics frameworks that leverage a limited amount of labeled data and ensure runtime analysis can achieve sufficient anomaly diagnosis performance for production HPC systems. To support this claim, we undertake ML-based performance analytics on two fronts. First, we design and develop novel frameworks for anomaly diagnosis that leverage semi-supervised or unsupervised learning techniques to reduce the need for extensive labeled data. Second, we design a simple yet adaptable architecture to enable deployment and demonstrate that these frameworks are feasible for runtime analysis.

This thesis makes the following specific contributions: First, we design a semi-supervised anomaly diagnosis framework, Proctor, which operates with hundreds of labeled samples (in contrast to tens of thousands) and a vast number of unlabeled samples. We show that Proctor outperforms the fully supervised baseline by up to 11% in F1-score for diagnosing anomalies when there are approximately 30 labeled samples. We then reframe the problem and introduce ALBADRoss to determine which samples should be labeled by experts to maximize the model performance using active learning. On a production HPC dataset, ALBADRoss achieves a 0.95 F1-score (the same score that a fully-supervised framework achieved) and near-zero false alarm rate using 24x fewer labeled samples. Finally, with Prodigy, we solve the anomaly detection problem but with a focus on deployment. Prodigy is designed for detecting performance anomalies on compute nodes using unsupervised learning. Our framework achieves a 0.95 F1-score in detecting anomalies on a production HPC system telemetry dataset. We also design a simple and adaptable software architecture and deploy it on a 1488-node production HPC system, detecting real-world performance anomalies with 88% accuracy.

Identiferoai:union.ndltd.org:bu.edu/oai:open.bu.edu:2144/47936
Date17 January 2024
CreatorsAksar, Burak
ContributorsCoskun, Ayse K.
Source SetsBoston University
Languageen_US
Detected LanguageEnglish
TypeThesis/Dissertation
RightsIn reference to IEEE copyrighted material which is used with permission in this thesis, the IEEE does not endorse any of [name of university or educational entity]'s products or services. Internal or personal use of this material is permitted. If interested in reprinting/republishing IEEE copyrighted material for advertising or promotional purposes or for creating new collective works for resale or redistribution, please go to http://www.ieee.org/publications_standards/publications/rights/rights_link.html to learn how to obtain a License from RightsLink. If applicable, University Microfilms and/or ProQuest Library, or the Archives of Canada may supply single copies of the dissertation.

Page generated in 0.0025 seconds