Spelling suggestions: "subject:"fault tolerance"" "subject:"vault tolerance""
131 |
A Low-latency Consensus Algorithm for Geographically Distributed SystemsArun, Balaji 15 May 2017 (has links)
This thesis presents Caesar, a novel multi-leader Generalized Consensus protocol for geographically replicated systems. Caesar is able to achieve near-perfect availability, provide high performance - low latency and high throughput compared to the existing state-of-the- art, and tolerate replica failures. Recently, a number of state-of-the-art consensus protocols that implement the Generalized Consensus definition have been proposed. However, the major limitation of these existing approaches is the significant performance degradation when application workload produces conflicting requests. Caesar's main goal is to overcome this limitation by changing the way a fast decision is taken: its ordering protocol does not reject a fast decision for a client request if a quorum of nodes reply with different dependency sets for that request. It only switches to a slow decision if there is no chance to agree on the proposed order for that request. Caesar is able to achieve this using a combination of wait condition and logical time stamping. The effectiveness of Caesar is demonstrated through an evaluation study performed on Amazon's EC2 infrastructure using 5 geo-replicated sites. Caesar outperforms other multi-leader (e.g., EPaxos) competitors by as much as 1.7x in presence of 30% conflicting requests, and single-leader (e.g., Multi-Paxos) by as much as 3.5x. The protocol is also resistant to heavy client loads unlike existing protocols. / Master of Science / Today, there exists a plethora of online services (e.g. Facebook, Google) that serve millions of users daily. Usually, each of these services have multiple subcomponents that work cohesively to deliver a rich user experience. One vital component that is prevalent in these services is the one that maintains the shared state. One example of a shared state component is a database, which enables operations on structured data. Such shared states are replicated across multiple server nodes, and even across multiple data centers to guarantee availability, i.e., if a node fails, other nodes can still serve requests on the shared state; low-latency, i.e., placing the copy of the shared state in a datacenter closer to the users will reduce the time required to serve the users; and scalability, i.e., the bottleneck that a single server node cannot serve millions of concurrent requests can be alleviated by having multiple nodes serve users at the same time. These replicated shared states need to be kept consistent i.e. every copy of the shared state must be the same in all the replicated nodes, and maintaining this consistency requires that each of these replicating nodes communicate with each other and reach an agreement on the order in which the operations on the shared data should be applied. In that regard, this thesis proposes Caesar, a consensus protocol with the aforementioned guarantees that will ease the deployment of services that contain a shared state. It addresses the problem of performance degradation in existing approaches when the same part of the shared state are accessed by multiple users that are connected to different server nodes. The effectiveness of Caesar is demonstrated through an evaluation study performed by deploying the protocol on five of Amazon’s data centers around the world. Caesar outperforms the existing state-of-the-art by as much as 3.5x. Caesar is also resistant to heavy client loads unlike existing protocols.
|
132 |
Network Fault Tolerance SystemSullivan, John F 01 May 2000 (has links)
The world of computers experienced an explosive period of growth toward the end of the 20th century with the widespread availability of the Internet and the development of the World Wide Web. As people began using computer networks for everything from research and communication to banking and commerce, network failures became a greater concern because of the potential to interrupt critical applications. Fault tolerance systems were developed to detect and correct network failures within minutes and eventually within seconds of the failure, but time-critical applications such as military communications, video conferencing, and Web-based sales require better response time than any previous systems could provide. The goal of this thesis was the development and implementation of a Network Fault Tolerance (NFT) system that can detect and recover from failures of network interface cards, network cables, switches, and routers in much less than one second from the time of failure. The problem was divided into two parts: fault tolerance within a single local area network (LAN), and fault tolerance across many local area networks. The first part involves the network interface cards, network cables, and switches within a LAN, which the second part involves the routers that connect LANs into larger internetworks. Both parts of the NFT solution were implemented on Windows NT 4.0 PC's connected by a switched Fast Ethernet network. The NFT system was found to correct system failures within 300 milliseconds of the failure.
|
133 |
Adaptive Fault Tolerance Strategies for Large Scale SystemsGeorge, Cijo January 2012 (has links) (PDF)
Exascale systems of the future are predicted to have mean time between node failures (MTBF) of less than one hour. At such low MTBF, the number of processors available for execution of a long running application can widely vary throughout the execution of the application. Employing traditional fault tolerance strategies like periodic checkpointing in these highly dynamic environments may not be effective because of the high number of application failures, resulting in large amount of work lost due to rollbacks apart from the increased recovery overheads. In this context, it is highly necessary to have fault tolerance strategies that can adapt to the changing node availability and also help avoid significant number of application failures. In this thesis, we present two adaptive fault tolerance strategies that make use of node failure pre-diction mechanisms to provide proactive fault tolerance for long running parallel applications on large scale systems.
The first part of the thesis deals with an adaptive fault tolerance strategy for malleable applications. We present ADFT, an adaptive fault tolerance framework for long running malleable applications to maximize application performance in the presence of failures. We first develop cost models that consider different factors like accuracy of node failure predictions and application scalability, for evaluating the benefits of various fault tolerance actions including check-pointing, live-migration and rescheduling. Our adaptive framework then uses the cost models to make runtime decisions for dynamically selecting the fault tolerance actions at different points of application execution to minimize application failures and maximize performance. Simulations with real and synthetic failure traces show that our approach outperforms existing fault tolerance mechanisms for malleable applications yielding up to 23% improvement in work done by the application in the presence of failures, and is effective even for petascale and exascale systems.
In the second part of the thesis, we present a fault tolerance strategy using adaptive process replication that can provide fault tolerance for applications using partial replication of a set of application processes. This fault tolerance framework adaptively changes the set of replicated processes (replicated set) periodically based on node failure predictions to avoid application failures. We have developed an MPI prototype implementation, PAREP-MPI that allows dynamically changing the replicated set of processes for MPI applications. Experiments with real scientific applications on real systems have shown that the overhead of PAREP-MPI is minimal. We have shown using simulations with real and synthetic failure traces that our strategy involving adaptive process replication significantly outperforms existing mechanisms providing up to 20% improvement in application efficiency even for exascale systems. Significant observations are also made which can drive future research efforts in fault tolerance for large and very large scale systems.
|
134 |
Identification of emergent off-nominal operational requirements during conceptual architecting of the more electric aircraftArmstrong, Michael James 09 November 2011 (has links)
With the current increased emphasis on the development of energy optimized vehicle systems architectures during the early phases in aircraft conceptual design, accurate predictions of these off-nominal requirements are needed to justify architecture concept selection. A process was developed for capturing architecture specific performance degradation strategies and optimally imposing their associated requirements. This process is enabled by analog extensions to traditional safety design and assessment tools and consists of six phases: Continuous Functional Hazard Assessment, Architecture Definition, Load Shedding Optimization, Analog System Safety Assessment, Architecture Optimization, and Architecture Augmentation.
Systematic off-nominal analysis of requirements was performed for dissimilar architecture concepts. It was shown that traditional discrete application of safety and reliability requirements have adverse effects on the prediction of requirements. This design bias was illustrated by cumulative unit importance metrics. Low fidelity representations of the loss/hazard relationship place undue importance on some units and yield under or over-predictions of system performance.
|
135 |
A framework for evolving grid computing systemsAlfawair, Mai January 2009 (has links)
Grid computing was born in the 1990s, when researchers were looking for a way to share expensive computing resources and experiment equipment. Grid computing is becoming increasingly popular because it promotes the sharing of distributed resources that may be heterogeneous in nature, and it enables scientists and engineering professionals to solve large scale computing problems. In reality, there are already huge numbers of grid computing facilities distributed around the world, each one having been created to serve a particular group of scientists such as weather forecasters, or a group of users such as stock markets. However, the need to extend the functionalities of current grid systems lends itself to the consideration of grid evolution. This allows the combination of many disjunct grids into a single powerful grid that can operate as one vast computational resource, as well as for grid environments to be flexible, to be able to change and to evolve. The rationale for grid evolution is the current rapid and increasing advances in both software and hardware. Evolution means adding or removing capabilities. This research defines grid evolution as adding new functions and/or equipment and removing unusable resources that affect the performance of some nodes. This thesis produces a new technique for grid evolution, allowing it to be seamless and to operate at run time. Within grid computing, evolution is an integration of software and hardware and can be of two distinct types, external and internal. Internal evolution occurs inside the grid boundary by migrating special resources such as application software from node to node inside the grid. While external evolution occurs between grids. This thesis develops a framework for grid evolution that insulates users from the complexities of grids. This framework has at its core a resource broker together with a grid monitor to cope with internal and external evolution, advance reservation, fault tolerance, the monitoring of the grid environment, increased resource utilisation and the high availability of grid resources. The starting point for the present framework of grid evolution is when the grid receives a job whose requirements do not exist on the required node which triggers grid evolution. If the grid has all the requirements scattered across its nodes, internal evolution enabling the grid to migrate the required resources to the required node in order to satisfy job requirements ensues, but if the grid does not have these resources, external evolution enables the grid either to collect them from other grids (permanent evolution) or to send the job to other grids for execution (just in time) evolution. Finally a simulation tool called (EVOSim) has been designed, developed and tested. It is written in Oracle 10g and has been used for the creation of four grids, each of which has a different setup including different nodes, application software, data and polices. Experiments were done by submitting jobs to the grid at run time, and then comparing the results and analysing the performance of those grids that use the approach of evolution with those that do not. The results of these experiments have demonstrated that these features significantly improve the performance of grid environments and provide excellent scheduling results, with a decreasing number of rejected jobs.
|
136 |
FAULT-TOLERANT DISTRIBUTED CHANNEL ALLOCATION ALGORITHMS FOR CELLULAR NETWORKSYang, Jianchang 01 January 2006 (has links)
In cellular networks, channels should be allocated efficiently to support communication betweenmobile hosts. In addition, in cellular networks, base stations may fail. Therefore, designing a faulttolerantchannel allocation algorithm is important. That is, the algorithm should tolerate failuresof base stations. Many existing algorithms are neither fault-tolerant nor efficient in allocatingchannels.We propose channel allocation algorithms which are both fault-tolerant and efficient. In theproposed algorithms, to borrow a channel, a base station (or a cell) does not need to get channelusage information from all its interference neighbors. This makes the algorithms fault-tolerant,i.e., the algorithms can tolerate base station failures, and perform well in the presence of thesefailures.Channel pre-allocation has effect on the performance of a channel allocation algorithm. Thiseffect has not been studied quantitatively. We propose an adaptive channel allocation algorithmto study this effect. The algorithm allows a subset of channels to be pre-allocated to cells. Performanceevaluation indicates that a channel allocation algorithm benefits from pre-allocating allchannels to cells.Channel selection strategy also inuences the performance of a channel allocation algorithm.Given a set of channels to borrow, how a cell chooses a channel to borrow is called the channelselection problem. When choosing a channel to borrow, many algorithms proposed in the literaturedo not take into account the interference caused by borrowing the channel to the cells which havethe channel allocated to them. However, such interference should be considered; reducing suchinterference helps increase the reuse of the same channel, and hence improving channel utilization.We propose a channel selection algorithm taking such interference into account.Most channel allocation algorithms proposed in the literature are for traditional cellular networkswith static base stations and the neighborhood relationship among the base stations is fixed.Such algorithms are not applicable for cellular networks with mobile base stations. We proposea channel allocation algorithm for cellular networks with mobile base stations. The proposedalgorithm is both fault-tolerant and reuses channels efficiently.KEYWORDS: distributed channel allocation, resource planning, fault-tolerance, cellular networks,3-cell cluster model.
|
137 |
Optimised configuration of sensing elements for control and fault tolerance applied to an electro-magnetic suspension systemMichail, Konstantinos January 2009 (has links)
New technological advances and the requirements to increasingly abide by new safety laws in engineering design projects highly affects industrial products in areas such as automotive, aerospace and railway industries. The necessity arises to design reduced-cost hi-tech products with minimal complexity, optimal performance, effective parameter robustness properties, and high reliability with fault tolerance. In this context the control system design plays an important role and the impact is crucial relative to the level of cost efficiency of a product. Measurement of required information for the operation of the design control system in any product is a vital issue, and in such cases a number of sensors can be available to select from in order to achieve the desired system properties. However, for a complex engineering system a manual procedure to select the best sensor set subject to the desired system properties can be very complicated, time consuming or even impossible to achieve. This is more evident in the case of large number of sensors and the requirement to comply with optimum performance. The thesis describes a comprehensive study of sensor selection for control and fault tolerance with the particular application of an ElectroMagnetic Levitation system (being an unstable, nonlinear, safety-critical system with non-trivial control performance requirements). The particular aim of the presented work is to identify effective sensor selection frameworks subject to given system properties for controlling (with a level of fault tolerance) the MagLev suspension system. A particular objective of the work is to identify the minimum possible sensors that can be used to cover multiple sensor faults, while maintaining optimum performance with the remaining sensors. The tools employed combine modern control strategies and multiobjective constraint optimisation (for tuning purposes) methods. An important part of the work is the design and construction of a 25kg MagLev suspension to be used for experimental verification of the proposed sensor selection frameworks.
|
138 |
High redundancy actuatorDu, Xinli January 2008 (has links)
High Redundancy Actuation (HRA) is a novel type of fault tolerant actuator. By comprising a relatively large number of actuation elements, faults in the elements can be inherently accommodated without resulting in a failure of the complete actuation system. By removing the possibility of faults detection and reconfiguration, HRA can provide high reliability and availability. The idea is motivated by the composition of human musculature. Our musculature can sustain damage and still function, sometimes with reduced performance, and even complete loss of a muscle group can be accommodated through kinematics redundancy, e.g. the use of just one leg. Electro-mechanical actuation is used as single element inside HRA. This thesis is started with modelling and simulation of individual actuation element and two basic structures to connect elements, in series and in parallel. A relatively simple HRA is then modelled which engages a two-by-two series-in-parallel configuration. Based on this HRA, position feedback controllers are designed using both classical and optimal algorithms under two control structures. All controllers are tested under both healthy and faults injected situations. Finally, a hardware demonstrator is set up based simulation studies. The demonstrator is controlled in real time using an xPC Target system. Experimental results show that the HRA can continuously work when one element fails, although performance degradation can be expected.
|
139 |
Quality of service of crash-recovery failure detectorsMa, Tiejun January 2007 (has links)
This thesis presents the results of an investigation into the failure detection problem. We consider the specific case of the Quality of Service (QoS) of crash failure detection. In contrast to previous work, we address the crash failure detection problem when the monitored target is resilient and recovers after failure. To the best of our knowledge, this is the first work to provide an analysis of crash-recovery failure detection from the QoS perspective. We develop a probabilistic model of the behavior of a crash-recovery target, i.e. one which has the ability to recover from the crash state. We show that the fail-free run and the crash-stop run are special cases of the crash-recovery run with mean time to failure (MTTF) approaching to infinity and mean time to recovery (MTTR) approaching to infinity, respectively. We extend the previously published QoS metrics to allow the measurement of the recovery speed, and the definition of the completeness property of a failure detector. Then, the impact of the dependability of the crash-recovery target on the QoS bounds for such a crash-recovery failure detector is analyzed using general dependability metrics, such as MTTF and MTTR, based on an approximate probabilistic model of the two-process failure detection system. Then according to our approximate model, we show how to estimate the failure detector’s parameters to achieve a required QoS, based on Chen et al.’s NFD-S algorithm analytically, and how to execute the configuration procedure of this crash-recovery failure detector. In order to make the failure detector adaptive to the target’s crash-recovery behavior and enable the autonomy of the monitoring procedure, we propose two types of recovery detection protocols. One is a reliable recovery detection protocol, which can guarantee to detect each occurring failure and recovery by adopting persistent storage. The other is a lightweight recovery detection protocol, which does not guarantee to detect every failure and recovery but which reduces the system overhead. Both of these recovery detection protocols improve the completeness without reducing the other QoS aspects of a failure detector. In addition, we also demonstrate how to estimate the inputs, such as the dependability metrics, using the failure detector itself. In order to evaluate our analytical work, we simulate the following failure detection algorithms: the simple heartbeat timeout algorithm, the NFD-S algorithm and the NFDS algorithm with the lightweight recovery detection protocol, for various values of MTTF and MTTR. The simulation results show that the dependability of a recoverable monitored target could have significant impact on the QoS of such a failure detector. This conforms well to our models and analysis. We show that in the case of reasonable long MTTF, the NFD-S algorithm with the lightweight recovery detection protocol exhibits better QoS than the NFD-S algorithm for the completeness of a crash-recovery failure detector, and similarly for other QoS metrics.
|
140 |
Proposition d’une architecture de contrôle adaptative pour la tolérance aux fautes / Proposition of an adaptive Control architecture for fault toleranceDurand, Bastien 15 June 2011 (has links)
Les architectures logicielles de contrôles sont le centre névralgique des robots. Malheureusement les robots et leurs architectures souffrent de nombreuses imperfections qui perturbent et/ou compromettent la réalisation des missions qui leurs sont affectés. Nous proposons donc une méthodologie de conception d'architecture de contrôle adaptative pour la mise en œuvre de la tolérance aux fautes.La première partie de ce manuscrit propose un état de l'art de la sureté de fonctionnement, d'abord générique avant d'être spécifié au contexte des architectures de contrôle. La seconde partie nous permet de détailler la méthodologie proposée permettant d'identifier les fautes potentielles d'un robot et d'y répondre à l'aide des moyens de tolérance aux fautes. La troisième partie présente le contexte expérimental et applicatif dans lequel la méthodologie proposée sera mise en œuvre et qui constitue la quatrième partie de ce manuscrit. Une expérimentation spécifique mettant en lumière les aspects de la méthodologie est détaillée dans la dernière partie. / The software control architectures are the decisional center of robots. Unfortunately, the robots and their architectures suffer from numerous flaws that disrupt and / or compromise the achievement of missions they are assigned. We therefore propose a methodology for designing adaptive control architecture for the implementation of fault tolerance.The first part of this thesis proposes a state of the art of dependability, at first in a generic way before being specified in the context of control architectures. The second part allows us to detail the proposed methodology to identify potential errors of a robot and respond using the means of fault tolerance. The third part presents the experimental context and application in which the proposed methodology will be implemented and described in the fourth part of this manuscript. An experiment highlighting specific aspects of the methodology is detailed in the last part.
|
Page generated in 0.1072 seconds