Return to search

Error isolation in distributed systems

In distributed systems, if a hardware fault corrupts the state of a process, this error might propagate as a corrupt message and contaminate other processes in the system, causing severe outages. Recently, state corruptions of this nature have been observed surprisingly often in large computer populations, e.g., in large-scale data centers. Moreover, since the resilience of processors is expected to decline in the near future, the likelihood of state corruptions will increase even further.

In this work, we argue that preventing the propagation of state corruption should be a first-class requirement for large-scale fault-tolerant distributed systems. In particular, we propose developers to target error isolation, the property in which each correct process ignores any corrupt message it receives.

Typically, a process cannot decide whether a received message is corrupt or not. Therefore, we introduce hardening as a class of principled approaches to implement error isolation in distributed systems. Hardening techniques are (semi-)automatic transformations that enforce that each process appends an evidence of good behavior in the form of error codes to all messages it sends. The techniques “virtualize” state corruptions into more benign failures such as crashes and message omissions: if a faulty process fails to detect its state corruption and abort, then hardening guarantees that any corrupt message the process sends has invalid error codes. Correct processes can then inspect received messages and drop them in case they are corrupt.

With this dissertation, we contribute theoretically and practically to the state of the art in fault-tolerant distributed systems. To show that hardening is possible, we design, formalize, and prove correct different hardening techniques that enable existing crash-tolerant designs to handle state corruption with minimal developer intervention. To show that hardening is practical, we implement and evaluate these techniques, analyzing their effect on the system performance and their ability to detect state corruptions in practice.

Identiferoai:union.ndltd.org:DRESDEN/oai:qucosa.de:bsz:14-qucosa-203428
Date25 May 2016
CreatorsBehrens, Diogo
ContributorsTechnische Universität Dresden, Fakultät Informatik, Ph.D. Christof Fetzer, Ph.D. Christof Fetzer, Ph.D. Flavio Junqueira
PublisherSaechsische Landesbibliothek- Staats- und Universitaetsbibliothek Dresden
Source SetsHochschulschriftenserver (HSSS) der SLUB Dresden
LanguageEnglish
Detected LanguageEnglish
Typedoc-type:doctoralThesis
Formatapplication/pdf

Page generated in 0.002 seconds