Global ETD Search

1	Machine learning-based performance analytics for high-performance computing systems Aksar, Burak 17 January 2024 (has links) High-performance Computing (HPC) systems play pivotal roles in societal and scientific advancements, executing up to quintillions of calculations every second. As we shift towards exascale computing and beyond, modern HPC systems emphasize resource sharing, where various applications share processors, memory, networks, and other components. While this sharing enhances power efficiency, it complicates performance prediction and introduces significant variations in application running times, affecting overall system efficiency and operational costs. HPC systems utilize monitoring frameworks that gather numerical telemetry data on resource usage to track operational status. Given the massive complexity and volume of this data, manual analysis is often daunting and inefficient. Machine learning (ML) techniques offer automated performance anomaly diagnosis, but the transition from successful research outcomes to production-scale deployment encounters two critical obstacles. First, the scarcity of labeled training data (i.e., identifying healthy and anomalous runs) in telemetry datasets makes it hard to train these ML systems effectively. Second, runtime analysis, required for providing timely detection and diagnosis of performance anomalies, demands seamless integration of ML-based methods with the monitoring frameworks. This thesis claims that ML-based performance analytics frameworks that leverage a limited amount of labeled data and ensure runtime analysis can achieve sufficient anomaly diagnosis performance for production HPC systems. To support this claim, we undertake ML-based performance analytics on two fronts. First, we design and develop novel frameworks for anomaly diagnosis that leverage semi-supervised or unsupervised learning techniques to reduce the need for extensive labeled data. Second, we design a simple yet adaptable architecture to enable deployment and demonstrate that these frameworks are feasible for runtime analysis. This thesis makes the following specific contributions: First, we design a semi-supervised anomaly diagnosis framework, Proctor, which operates with hundreds of labeled samples (in contrast to tens of thousands) and a vast number of unlabeled samples. We show that Proctor outperforms the fully supervised baseline by up to 11% in F1-score for diagnosing anomalies when there are approximately 30 labeled samples. We then reframe the problem and introduce ALBADRoss to determine which samples should be labeled by experts to maximize the model performance using active learning. On a production HPC dataset, ALBADRoss achieves a 0.95 F1-score (the same score that a fully-supervised framework achieved) and near-zero false alarm rate using 24x fewer labeled samples. Finally, with Prodigy, we solve the anomaly detection problem but with a focus on deployment. Prodigy is designed for detecting performance anomalies on compute nodes using unsupervised learning. Our framework achieves a 0.95 F1-score in detecting anomalies on a production HPC system telemetry dataset. We also design a simple and adaptable software architecture and deploy it on a 1488-node production HPC system, detecting real-world performance anomalies with 88% accuracy. Computer engineering Anomaly detection Artificial intelligence High-performance computing Large-scale computing systems Machine learning

Search results

Machine learning-based performance analytics for high-performance computing systems