Return to search

Toward Resilience in High Performance Computing:: A Prototype to Analyze and Predict System Behavior

Following the growth of high performance computing systems (HPC) in size and complexity, and the advent of faster and more complex Exascale systems, failures became the norm rather than the exception. Hence, the protection mechanisms need to be improved. The most de facto mechanisms such as checkpoint/restart or redundancy may also fail to support the continuous operation of future HPC systems in the presence of failures. Failure prediction is a new protection approach that is beneficial for HPC systems with a short mean time between failure. The failure prediction mechanism extends the existing protection mechanisms via the dynamic adjustment of the protection level. This work provides a prototype to analyze and predict system behavior using statistical analysis to pave the path toward resilience in HPC systems. The proposed anomaly detection method is noise-tolerant by design and produces accurate results with as little as 30 minutes of historical data. Machine learning models complement the main approach and further improve the accuracy of failure predictions up to 85%. The fully automatic unsupervised behavior analysis approach, proposed in this work, is a novel solution to protect future extreme-scale systems against failures.:1 Introduction
1.1 Background and Statement of the Problem
1.2 Purpose and Significance of the Study
1.3 Jam–e Jam: A System Behavior Analyzer
2 Review of the Literature
2.1 Syslog Analysis
2.2 Users and Systems Privacy
2.3 Failure Detection and Prediction
2.3.1 Failure Correlation
2.3.2 Anomaly Detection
2.3.3 Prediction Methods
2.3.4 Prediction Accuracy and Lead Time
3 Data Collection and Preparation
3.1 Taurus HPC Cluster
3.2 Monitoring Data
3.2.1 Data Collection
3.2.2 Taurus System Log Dataset
3.3 Data Preparation
3.3.1 Users and Systems Privacy
3.3.2 Storage and Size Reduction
3.3.3 Automation and Improvements
3.3.4 Data Discretization and Noise Mitigation
3.3.5 Cleansed Taurus System Log Dataset
3.4 Marking Potential Failures
4 Failure Prediction
4.1 Null Hypothesis
4.2 Failure Correlation
4.2.1 Node Vicinities
4.2.2 Impact of Vicinities
4.3 Anomaly Detection
4.3.1 Statistical Analysis (frequency)
4.3.2 Pattern Detection (order)
4.3.3 Machine Learning
4.4 Adaptive resilience
5 Results
5.1 Taurus System Logs
5.2 System-wide Failure Patterns
5.3 Failure Correlations
5.4 Taurus Failures Statistics
5.5 Jam-e Jam Prototype
5.6 Summary and Discussion
6 Conclusion and Future Works
Bibliography
List of Figures
List of Tables
Appendix A Neural Network Models
Appendix B External Tools
Appendix C Structure of Failure Metadata Databse
Appendix D Reproducibility
Appendix E Publicly Available HPC Monitoring Datasets
Appendix F Glossary
Appendix G Acronyms

Identiferoai:union.ndltd.org:DRESDEN/oai:qucosa:de:qucosa:72457
Date16 October 2020
CreatorsGhiasvand, Siavash
ContributorsNagel, Wolfgang E., Schulz, Martin
PublisherTechnische Universität Dresden
Source SetsHochschulschriftenserver (HSSS) der SLUB Dresden
LanguageEnglish
Detected LanguageEnglish
Typeinfo:eu-repo/semantics/publishedVersion, doc-type:doctoralThesis, info:eu-repo/semantics/doctoralThesis, doc-type:Text
Rightsinfo:eu-repo/semantics/openAccess

Page generated in 0.0022 seconds