As large-scale scientific computing platforms increase in size and capability, their complexity also grows. These systems require great care and attention, much of which is due to the rise in failures from increased node/ component counts. Fault tolerance, or resilience, is a key challenge for computing and a major factor in the successful utilization of high-end scientific computing platforms. As the importance of fault tolerance increases, methods for experimentation into new mechanisms and policies are critical. The methodical investigation of failure in these systems is hampered by their scale, and a lack of tools for controlled experimentation. The focus of this research is to provide a versatile: low-overhead platform for fault tolerance/ resilience experimentation in a high-performance computing (HPC) environment. The objective is to extend the HPC workflow and toolkit to provide ways for studying largescale scientific applications at extreme scales with synthetic faults (errors) in a controlled environment. As part of this research we leverage prior work in the areas of HPC system software and performance evaluation tools to enable controlled experimentation through fault injection, while maintaining acceptable performance for scientific workloads. The research identifies two crucial characteristics that are balanced for fault-injection experiments: (i) integration (context), and (ii) isolation (protection). The result of this research is a Resilience Investigation Framework (RIF) that provides HPC users and developers a versatile experimental framework that balances integration and isolation when exploring resilience methods and policies in large-scale systems
Identifer | oai:union.ndltd.org:bl.uk/oai:ethos.bl.uk:658011 |
Date | January 2014 |
Creators | Naughton, Thomas J. |
Publisher | University of Reading |
Source Sets | Ethos UK |
Detected Language | English |
Type | Electronic Thesis or Dissertation |
Page generated in 0.0023 seconds