Global ETD Search

191	Detecting and tolerating faults in distributed systems Ogale, Vinit Arun, 1979- 05 October 2012 (has links) This dissertation presents techniques for detecting and tolerating faults in distributed systems. Detecting faults in distributed or parallel systems is often very difficult. We look at the problem of determining if a property or assertion was true in the computation. We formally define a logic called BTL that can be used to define such properties. Our logic takes temporal properties in consideration as these are often necessary for expressing conditions like safety violations and deadlocks. We introduce the idea of a basis of a computation with respect to a property. A basis is a compact and exact representation of the states of the computation where the property was true. We exploit the lattice structure of the computation and the structure of different types of properties and avoid brute force approaches. We have shown that it is possible to efficiently detect all properties that can be expressed by using nested negations, disjunctions, conjunctions and the temporal operators possibly and always. Our algorithm is polynomial in the number of processes and events in the system, though it is exponential in the size of the property. After faults are detected, it is necessary to act on them and, whenever possible, continue operation with minimal impact. This dissertation also deals with designing systems that can recover from faults. We look at techniques for tolerating faults in data and the state of the program. Particularly, we look at the problem where multiple servers have different data and program state and all of these need to be backed up to tolerate failures. Most current approaches to this problem involve some sort of replication. Other approaches based on erasure coding have high computational and communication overheads. We introduce the idea of fusible data structures to back up data. This approach relies on the inherent structure of the data to determine techniques for combining multiple such structures on different servers into a single backup data structure. We show that most commonly used data structures like arrays, lists, stacks, queues, and so on are fusible and present algorithms for this. This approach requires less space than replication without increasing the time complexities for any updates. In case of failures, data from the back up and other non-failed servers is required to recover. To maintain program state in case of failures, we assume that programs can be represented by deterministic finite state machines. Though this approach may not yet be practical for large programs it is very useful for small concurrent programs like sensor networks or finite state machines in hardware designs. We present the theory of fusion of state machines. Given a set of such machines, we present a polynomial time algorithm to compute another set of machines which can tolerate the required number of faults in the system. / text Fault-tolerant computing System design Computer systems--Reliability
192	Efficient fault tolerance for pipelined structures and its application to superscalar and dataflow machines Mizan, Elias, 1976- 10 October 2012 (has links) Silicon reliability has reemerged as a very important problem in digital system design. As voltage and device dimensions shrink, combinational logic is becoming sensitive to temporary errors caused by single event upsets, transistor and interconnect aging and circuit variability. In particular, computational functional units are very challenging to protect because current redundant execution techniques have a high power and area overhead, cannot guarantee detection of some errors and cause a substantial performance degradation. As traditional worst-case design rules that guarantee error avoidance become too conservative to be practical, new microarchitectures need to be investigated to address this problem. To this end, this dissertation introduces Self-Imposed Temporal Redundancy (SITR), a speculative microarchitectural temporal redundancy technique suitable for pipelined computational functional units. SITR is able to detect most temporary errors, is area and energy-efficient and can be easily incorporated in an out-of-order microprocessor. SITR can also be used as a throttling mechanism against thermal viruses and, in some cases, allows designers to design very aggressive bypass networks capable of achieving high instruction throughput, by tolerating timing violations. To address the performance degradation caused by redundant execution, this dissertation proposes using a tiled-data ow model of computation because it enables the design of scalable, resource-rich computational substrates. Starting with the WaveScalar tiled-data flow architecture, we enhance the reliability of its datapath, including computational logic, interconnection network and storage structures. Computations are performed speculatively using SITR while traditional information redundancy techniques are used to protect data transmission and storage. Once a value has been verified, confirmation messages are transmitted to consumer instructions. Upon error detection, nullification messages are sent to the instructions affected by the error. Our experimental results demonstrate that the slowdown due to redundant computation and error recovery on the tiled-data flow machine is consistently smaller than on a superscalar von Neumann architecture. However, the number of additional messages required to support SITR execution is substantial, increasing power consumption. To reduce this overhead without significantly affecting performance, we introduce wave-based speculation, a mechanism targeted for data flow architectures that enables speculation only when it is likely to benefit performance. / text Fault-tolerant computing Computers, Pipeline--Reliability Data flow computing Computer architecture
193	Performance monitoring and fault-tolerant control of complex systems with variable operating conditions Cholette, Michael Edward 11 October 2012 (has links) Ensuring the reliable operation of engineering systems has long been a subject of great practical and academic interest. This interest is clearly demonstrated by the preponderance of literature in the area of Fault Detection and Diagnosis (FDD) and Fault Tolerant Control (FTC), spanning the past three decades. However, increasingly stringent performance and safety requirements have led to engineering systems with progressively increasing complexity. This complexity has rendered many traditional FDD and FTC methods exceedingly cumbersome, often to the point of infeasibility. This thesis aims to enable FDD and FTC for complex engineering systems of interacting dynamic subsystems. For such systems, generic FDD/FTC methods have remained elusive. Effects caused by nonlinearities, interactions between subsystems and varying usage patterns complicate FDD and FTC. The goal of this thesis is to develop methods for FDD and FTC that will allow decoupling of anomalies occurred inside the monitored system from those occurred in the systems affecting the monitored system, as well as enabling performance recovery of the monitored system. In pursuit of these goals, FDD and FTC methods are explored that can account for operating regime-dependent effects in monitoring, diagnosis, prognosis and performance recovery for two classes of machines: those that operate in modes that can change only at distinct times (which often occur in manufacturing opera- tions such as drilling, milling, turning) and for those that operate in regimes that are continuously varying (such as automotive systems or electric motors). For machines that operate in modes that can change only at distinct times, a degradation model is postulated which describes how the system degrades over time for each operating regime. Using the framework of Hidden Markov Models (HMMs), modeling and identification tools are developed that enable identification a HMM of degradation for each machine operation. In the sequel, monitoring and prognosis methods that naturally follow from the framework of HMMs are also presented. The modeling and monitoring methodology is then applied to a real-world semiconductor manufacturing process using data provided by a major manufacturer. For machines that operate in regimes that are continuously varying, a behavioral model is postulated that describes the input-output dynamics of the nor- mal system in different operating regimes. Monitoring methods are presented that have the capability to account for operating regime-dependent modeling accuracies and isolate faults that have not been anticipated and for which no fault models are available. By conducting fault detection in a regime-dependent fashion, changes in modeling errors that are due to operating regime changes can be successfully distinguished from changes that are due to truly faulty operation caused by changes in the system dynamics. Enabled by this, unanticipated faults can be isolated through proliferation of the fault detection through the various subsystems of the anoma- lous system. The FDD methodology is applied to detect and diagnose faults for a multiple-input multiple-output Exhaust Gas Recirculation system in a diesel engine. Finally, methods to facilitate the recovery of normal system behavior are detailed. Using the same local model structure that was pursued for behavioral models, it is envisioned that the nominal controller will be reconfigured to attempt to recover nominal behavior as much as possible. To enable this reconfiguration, methods for automated design of closed-loop controllers for the local modeling structure are presented. Using a model-predictive approach with rigorous stability considerations, it is shown that the controllers can track a reference trajectory. Such a trajectory could be generated by any model that satisfies the control objectives, for normal or faulty systems. The controllers are then demonstrated on a benchmark nonlinear system that is nonlinear in the control. / text Fault detection and diagnosis Neural networks Model-predictive control Fault-tolerant control
194	Adaptable stateful application server replication Wu, Huaigu, 1975- January 2008 (has links) In recent years, multi-tier architectures have become the standard computing environment for web- and enterprise applications. The application server tier is often the heart of the system embedding the business logic. Adaptability, in particular the capability to adjust to the load submitted to the system and to handle the failure of individual components, are of outmost importance in order to provide 7/24 access and high performance. Replication is a common means to achieve these reliability and scalability requirements. With replication, the application server tier consists of several server replicas. Thus, if one replica fails, others can take over. Furthermore, the load can be distributed across the available replicas. Although many replication solutions have been proposed so far, most of them have been either developed for fault-tolerance or for scalability. Furthermore, only few have considered that the application server tier is only one tier in a multi-tier architecture, that this tier maintains state, and that execution in this environment can follow complex patterns. Thus, existing solutions often do not provide correctness beyond some basic application scenarios. / In this thesis we tackle the issue of replication of the application server tier from ground off and develop a unified solution that provides both fault-tolerance and scalability. We first describe a set of execution patterns that describe how requests are typically executed in multi-tier architectures. They consider the flow of execution across client tier, application server tier, and database tier. In particular, the execution patterns describe how requests are associated with transactions, the fundamental execution units at application server and database tiers. Having these execution patterns in mind, we provide a formal definition of what it means to provide a correct execution across all tiers, even in case failures occur and the application server tier is replicated. Informally, a replicated system is correct if it behaves exactly as a non-replicated that never fails. From there, we propose a set of replication algorithms for fault-tolerance that provide correctness for the execution patterns that we have identified The main principle is to let a primary AS replica to execute all client requests, and to propagate any state changes performed by a transaction to backup replicas at transaction commit time. The challenges occur as requests can be associated in different ways with transactions. Then, we extend our fault-tolerance solution and develop a unified solution that provides both fault-tolerance and load-balancing. In this extended solution, each application server replica is able to execute client requests as a primary and at the same time serves as backup for other replicas. The framework provides a transparent, truly distributed and lightweight load distribution mechanism which takes advantage of the fault-tolerance infrastructure. Our replication tool is implemented as a plug-in of JBoss application server and the performance is carefully evaluated, comparing with JBoss' own replication solutions. The evaluation shows that our protocols have very good performance and compare favorably with existing solutions. Client/server computing. Internet programming. Computer systems -- Reliability. Fault-tolerant computing.
195	Negative Quasi-Probability in the Context of Quantum Computation Veitch, Victor January 2013 (has links) This thesis deals with the question of what resources are necessary and sufficient for quantum computational speedup. In particular, we study what resources are required to promote fault tolerant stabilizer computation to universal quantum computation. In this context we discover a remarkable connection between the possibility of quantum computational speedup and negativity in the discrete Wigner function, which is a particular distinguished quasi-probability representation for quantum theory. This connection allows us to establish a number of important results related to magic state computation, an important model for fault tolerant quantum computation using stabilizer operations supplemented by the ability to prepare noisy non-stabilizer ancilla states. In particular, we resolve in the negative the open problem of whether every non-stabilizer resource suffices to promote computation with stabilizer operations to universal quantum computation. Moreover, by casting magic state computation as resource theory we are able to quantify how useful ancilla resource states are for quantum computation, which allows us to give bounds on the required resources. In this context we discover that the sum of the negative entries of the discrete Wigner representation of a state is a measure of its usefulness for quantum computation. This gives a precise, quantitative meaning to the negativity of a quasi-probability representation, thereby resolving the 80 year debate as to whether this quantity is a meaningful indicator of quantum behaviour. We believe that the techniques we develop here will be widely applicable in quantum theory, particularly in the context of resource theories. Quantum physics quantum computing quantum foundations fault tolerant quantum computation quasi-probability Wigner function Applied Mathematics
196	Fault-Tolerance Strategies and Probabilistic Guarantees for Real-Time Systems Aysan, Hüseyin January 2012 (has links) Ubiquitous deployment of embedded systems is having a substantial impact on our society, since they interact with our lives in many critical real-time applications. Typically, embedded systems used in safety or mission critical applications (e.g., aerospace, avionics, automotive or nuclear domains) work in harsh environments where they are exposed to frequent transient faults such as power supply jitter, network noise and radiation. They are also susceptible to errors originating from design and production faults. Hence, they have the design objective to maintain the properties of timeliness and functional correctness even under error occurrences. Fault-tolerance plays a crucial role towards achieving dependability, and the fundamental requirement for the design of effective and efficient fault-tolerance mechanisms is a realistic and applicable model of potential faults and their manifestations. An important factor to be considered in this context is the random nature of faults and errors, which, if addressed in the timing analysis by assuming a rigid worst-case occurrence scenario, may lead to inaccurate results. It is also important that the power, weight, space and cost constraints of embedded systems are addressed by efficiently using the available resources for fault-tolerance. This thesis presents a framework for designing predictably dependable embedded real-time systems by jointly addressing the timeliness and the reliability properties. It proposes a spectrum of fault-tolerance strategies particularly targeting embedded real-time systems. Efficient resource usage is attained by considering the diverse criticality levels of the systems' building blocks. The fault-tolerance strategies are complemented with the proposed probabilistic schedulability analysis techniques, which are based on a comprehensive stochastic fault and error model. embedded systems real-time systems fault tolerant design real-time analysis dependability analysis Computer engineering Datorteknik
197	Advances in Fault Diagnosis and Fault Tolerant Control Motivated by Large Flexible Space Structure Kok, Yao Hong 29 November 2013 (has links) In this thesis, two problems are studied. The first problem is to find a technique to generate a particular type of failure information in real time for large flexible space structures (LFSSs). This problem is solved by using structured residuals. The failure information is then incorporated into an existing fault tolerant control scheme. The second problem is a ``spin-off'' from the first. Although the H-infinity sliding mode observer (SMO) cannot be applied to the colocated LFSS , its ability to do robust state and fault estimation of the SMO makes it suitable to be used in an integrated fault tolerant control (IFTC) scheme. We propose to combine the H-infinity SMO with a linear fault accommodation controller. Our IFTC scheme is closed loop stable, suppresses the effects of faults and enjoys enhanced robustness to disturbances. The effectiveness of the IFTC is illustrated through the control of a permanent magnet synchronous motor under actuator fault. Fault Diagnosis Fault Tolerant Control Large Flexible Space Structure Control 0790
198	Advances in Fault Diagnosis and Fault Tolerant Control Motivated by Large Flexible Space Structure Kok, Yao Hong 29 November 2013 (has links) In this thesis, two problems are studied. The first problem is to find a technique to generate a particular type of failure information in real time for large flexible space structures (LFSSs). This problem is solved by using structured residuals. The failure information is then incorporated into an existing fault tolerant control scheme. The second problem is a ``spin-off'' from the first. Although the H-infinity sliding mode observer (SMO) cannot be applied to the colocated LFSS , its ability to do robust state and fault estimation of the SMO makes it suitable to be used in an integrated fault tolerant control (IFTC) scheme. We propose to combine the H-infinity SMO with a linear fault accommodation controller. Our IFTC scheme is closed loop stable, suppresses the effects of faults and enjoys enhanced robustness to disturbances. The effectiveness of the IFTC is illustrated through the control of a permanent magnet synchronous motor under actuator fault. Fault Diagnosis Fault Tolerant Control Large Flexible Space Structure Control 0790
199	Automatically increasing fault tolerance in distributed systems Bazzi, Rida Adnan January 1994 (has links) No description available. Fault-tolerant computing
200	Support for fault-tolerant computations in distributed object systems Chelliah, Muthusamy January 1996 (has links) No description available. Fault-tolerant computing

Search results