Return to search

Formal Configuration of Fault-Tolerant Systems

Bit flips are known to be a source of strange system behavior, failures, and crashes. They can cause dramatic financial loss, security breaches, or even harm human life. Caused by energized particles arising from, e.g., cosmic rays or heat, they are hardly avoidable. Due to transistor sizes becoming smaller and smaller, modern hardware becomes more and more prone to bit flips. This yields a high scientific interest, and many techniques to make systems more resilient against bit flips are developed. Fault-tolerance techniques are techniques that detect and react to bit flips or their effects. Before using these techniques, they typically need to be configured for the particular system they shall protect, the grade of resilience that shall be achieved, and the environment. State-of-the-art configuration approaches have a high risk of being imprecise, of being affected by undesired side effects, and of yielding questionable resilience measures.

In this thesis we encourage the usage of formal methods for resiliency configuration, point out advantages and investigate difficulties. We exemplarily investigate two systems that are equipped with fault-tolerance techniques, and we apply parametric variants of probabilistic model checking to obtain optimal configurations for pre-defined resilience criteria. Probabilistic model checking is an automated formal method that operates on Markov models, i.e., state-based models with probabilistic transitions, where costs or rewards can be assigned to states and transitions. Probabilistic model checking can be used to compute, e.g., the probability of having a failure, the conditional probability of detecting an error in case of bit-flip occurrence, or the overhead that arises due to error detection and correction. Parametric variants of probabilistic model checking allow parameters in the transition probabilities and in the costs and rewards. Instead of computing values for probabilities and overhead, parametric variants compute rational functions. These functions can then be analyzed for optimality.

The considered fault-tolerant systems are inspired by the work of project partners. The first system is an inter-process communication protocol as it is used in the Fiasco.OC microkernel. The communication structures provided by the kernel are protected against bit flips by a fault-tolerance technique. The second system is inspired by the redo-based fault-tolerance technique \haft. This technique protects an application against bit flips by partitioning the application's instruction flow into transaction, adding redundance, and redoing single transactions in case of error detection.

Driven by these examples, we study challenges when using probabilistic model checking for fault-tolerance configuration and present solutions. We show that small transition probabilities, as they arise in error models, can be a cause of previously known accuracy issues, when using numeric solver in probabilistic model checking. We argue that the use of non-iterative methods is an acceptable alternative. We debate on the usability of the rational functions for finding optimal configurations, and show that for relatively short rational functions the usage of mathematical methods is appropriate.

The redo-based fault-tolerance model suffers from the well-known state-explosion problem. We present a new technique, counter-based factorization, that tackles this problem for system models that do not scale because of a counter, as it is the case for this fault-tolerance model. This technique utilizes the chain-like structure that arises from the counter, splits the model into several parts, and computes local characteristics (in terms of rational functions) for these parts. These local characteristics can then be combined to retrieve global resiliency and overhead measures. The rational functions retrieved for the redo-based fault-tolerance model are huge - for small model instances they already have the size of more than one gigabyte. We therefor can not apply precise mathematic methods to these functions. Instead, we use the short, matrix-based representation, that arises from factorization, to point-wise evaluate the functions.
Using this approach, we systematically explore the design space of the redo-based fault-tolerance model and retrieve sweet-spot configurations.

Identiferoai:union.ndltd.org:DRESDEN/oai:qucosa:de:qucosa:34074
Date28 May 2019
CreatorsHerrmann, Linda
ContributorsBaier, Christel, Technische Universität Dresden
Source SetsHochschulschriftenserver (HSSS) der SLUB Dresden
LanguageEnglish
Detected LanguageEnglish
Typeinfo:eu-repo/semantics/publishedVersion, doc-type:doctoralThesis, info:eu-repo/semantics/doctoralThesis, doc-type:Text
Rightsinfo:eu-repo/semantics/openAccess

Page generated in 0.0019 seconds