Return to search

How Failures Cascade in Software Systems

Cascading failures involve a failure in one system component that triggers failures in successive system components, potentially leading to system wide failures. While frequently used fault tolerant techniques can reduce the severity and the frequency of such failures, they continue to occur in practice. To better understand how failures cascade, we have conducted a qualitative analysis of 55 cascading failures, described in 26 publicly available incident reports. Through this analysis we have identified 16 types of cascading mechanisms (organized into eight categories) that capture the nature of the system interactions that contribute to cascading failures. We also discuss three themes based on the observation that the cascading failures we have analyzed occurred in one of three ways: a component being unable to tolerate a failure in another component, through the actions of support or automation systems as they respond to an initial failure, or during system recovery. We believe that the 16 cascading mechanisms we present and the three themes we discuss, provide important insights into some of the challenges associated with engineering a truly resilient and well-supported system.

Identiferoai:union.ndltd.org:BGMYU2/oai:scholarsarchive.byu.edu:etd-10483
Date18 April 2022
CreatorsChamberlin, Barbara W.
PublisherBYU ScholarsArchive
Source SetsBrigham Young University
Detected LanguageEnglish
Typetext
Formatapplication/pdf
SourceTheses and Dissertations
Rightshttps://lib.byu.edu/about/copyright/

Page generated in 0.0021 seconds