Return to search

Towards a resilience investigation framework for high performance computing

As large-scale scientific computing platforms increase in size and capability, their complexity also grows. These systems require great care and attention, much of which is due to the rise in failures from increased node/ component counts. Fault tolerance, or resilience, is a key challenge for computing and a major factor in the successful utilization of high-end scientific computing platforms. As the importance of fault tolerance increases, methods for experimentation into new mechanisms and policies are critical. The methodical investigation of failure in these systems is hampered by their scale, and a lack of tools for controlled experimentation. The focus of this research is to provide a versatile: low-overhead platform for fault tolerance/ resilience experimentation in a high-performance computing (HPC) environment. The objective is to extend the HPC workflow and toolkit to provide ways for studying largescale scientific applications at extreme scales with synthetic faults (errors) in a controlled environment. As part of this research we leverage prior work in the areas of HPC system software and performance evaluation tools to enable controlled experimentation through fault injection, while maintaining acceptable performance for scientific workloads. The research identifies two crucial characteristics that are balanced for fault-injection experiments: (i) integration (context), and (ii) isolation (protection). The result of this research is a Resilience Investigation Framework (RIF) that provides HPC users and developers a versatile experimental framework that balances integration and isolation when exploring resilience methods and policies in large-scale systems

Identiferoai:union.ndltd.org:bl.uk/oai:ethos.bl.uk:658011
Date January 2014
CreatorsNaughton, Thomas J.
PublisherUniversity of Reading
Source SetsEthos UK
Detected LanguageEnglish
TypeElectronic Thesis or Dissertation

Page generated in 0.0024 seconds