Return to search

Minimizing Overhead for Fault Tolerance in Event Stream Processing Systems

Event Stream Processing (ESP) is a well-established approach for low-latency data processing enabling users to quickly react to relevant situations in soft real-time. In order to cope with the sheer amount of data being generated each day and to cope with fluctuating workloads originating from data sources such as Twitter and Facebook, such systems must be highly scalable and elastic. Hence, ESP systems are typically long running applications deployed on several hundreds of nodes in either dedicated data-centers or cloud environments such as Amazon EC2. In such environments, nodes are likely to fail due to software aging, process or hardware errors whereas the unbounded stream of data asks for continuous processing.

In order to cope with node failures, several fault tolerance approaches have been proposed in literature. Active replication and rollback recovery-based on checkpointing and in-memory logging (upstream backup) are two commonly used approaches in order to cope with such failures in the context of ESP systems. However, these approaches suffer either from a high resource footprint, low throughput or unresponsiveness due to long recovery times. Moreover, in order to recover applications in a precise manner using exactly once semantics, the use of deterministic execution is required which adds another layer of complexity and overhead.

The goal of this thesis is to lower the overhead for fault tolerance in ESP systems. We first present StreamMine3G, our ESP system we built entirely from scratch in order to study and evaluate novel approaches for fault tolerance and elasticity. We then present an approach to reduce the overhead of deterministic execution by using a weak, epoch-based rather than strict ordering scheme for commutative and tumbling windowed operators that allows applications to recover precisely using active or passive replication. Since most applications are running in cloud environments nowadays, we furthermore propose an approach to increase the system availability by efficiently utilizing spare but paid resources for fault tolerance. Finally, in order to free users from the burden of choosing the correct fault tolerance scheme for their applications that guarantees the desired recovery time while still saving resources, we present a controller-based approach that adapts fault tolerance at runtime. We furthermore showcase the applicability of our StreamMine3G approach using real world applications and examples.

Identiferoai:union.ndltd.org:DRESDEN/oai:qucosa:de:qucosa:29815
Date17 December 2015
CreatorsMartin, André
ContributorsFetzer, Christof, Pietzuch, Peter, Technische Universität Dresden
Source SetsHochschulschriftenserver (HSSS) der SLUB Dresden
LanguageEnglish
Detected LanguageEnglish
Typedoc-type:doctoralThesis, info:eu-repo/semantics/doctoralThesis, doc-type:Text
Rightsinfo:eu-repo/semantics/openAccess

Page generated in 0.0026 seconds