Global ETD Search

131	Detecting and tolerating faults in distributed systems Ogale, Vinit Arun, 1979- 05 October 2012 (has links) This dissertation presents techniques for detecting and tolerating faults in distributed systems. Detecting faults in distributed or parallel systems is often very difficult. We look at the problem of determining if a property or assertion was true in the computation. We formally define a logic called BTL that can be used to define such properties. Our logic takes temporal properties in consideration as these are often necessary for expressing conditions like safety violations and deadlocks. We introduce the idea of a basis of a computation with respect to a property. A basis is a compact and exact representation of the states of the computation where the property was true. We exploit the lattice structure of the computation and the structure of different types of properties and avoid brute force approaches. We have shown that it is possible to efficiently detect all properties that can be expressed by using nested negations, disjunctions, conjunctions and the temporal operators possibly and always. Our algorithm is polynomial in the number of processes and events in the system, though it is exponential in the size of the property. After faults are detected, it is necessary to act on them and, whenever possible, continue operation with minimal impact. This dissertation also deals with designing systems that can recover from faults. We look at techniques for tolerating faults in data and the state of the program. Particularly, we look at the problem where multiple servers have different data and program state and all of these need to be backed up to tolerate failures. Most current approaches to this problem involve some sort of replication. Other approaches based on erasure coding have high computational and communication overheads. We introduce the idea of fusible data structures to back up data. This approach relies on the inherent structure of the data to determine techniques for combining multiple such structures on different servers into a single backup data structure. We show that most commonly used data structures like arrays, lists, stacks, queues, and so on are fusible and present algorithms for this. This approach requires less space than replication without increasing the time complexities for any updates. In case of failures, data from the back up and other non-failed servers is required to recover. To maintain program state in case of failures, we assume that programs can be represented by deterministic finite state machines. Though this approach may not yet be practical for large programs it is very useful for small concurrent programs like sensor networks or finite state machines in hardware designs. We present the theory of fusion of state machines. Given a set of such machines, we present a polynomial time algorithm to compute another set of machines which can tolerate the required number of faults in the system. / text Fault-tolerant computing System design Computer systems--Reliability
132	Efficient fault tolerance for pipelined structures and its application to superscalar and dataflow machines Mizan, Elias, 1976- 10 October 2012 (has links) Silicon reliability has reemerged as a very important problem in digital system design. As voltage and device dimensions shrink, combinational logic is becoming sensitive to temporary errors caused by single event upsets, transistor and interconnect aging and circuit variability. In particular, computational functional units are very challenging to protect because current redundant execution techniques have a high power and area overhead, cannot guarantee detection of some errors and cause a substantial performance degradation. As traditional worst-case design rules that guarantee error avoidance become too conservative to be practical, new microarchitectures need to be investigated to address this problem. To this end, this dissertation introduces Self-Imposed Temporal Redundancy (SITR), a speculative microarchitectural temporal redundancy technique suitable for pipelined computational functional units. SITR is able to detect most temporary errors, is area and energy-efficient and can be easily incorporated in an out-of-order microprocessor. SITR can also be used as a throttling mechanism against thermal viruses and, in some cases, allows designers to design very aggressive bypass networks capable of achieving high instruction throughput, by tolerating timing violations. To address the performance degradation caused by redundant execution, this dissertation proposes using a tiled-data ow model of computation because it enables the design of scalable, resource-rich computational substrates. Starting with the WaveScalar tiled-data flow architecture, we enhance the reliability of its datapath, including computational logic, interconnection network and storage structures. Computations are performed speculatively using SITR while traditional information redundancy techniques are used to protect data transmission and storage. Once a value has been verified, confirmation messages are transmitted to consumer instructions. Upon error detection, nullification messages are sent to the instructions affected by the error. Our experimental results demonstrate that the slowdown due to redundant computation and error recovery on the tiled-data flow machine is consistently smaller than on a superscalar von Neumann architecture. However, the number of additional messages required to support SITR execution is substantial, increasing power consumption. To reduce this overhead without significantly affecting performance, we introduce wave-based speculation, a mechanism targeted for data flow architectures that enables speculation only when it is likely to benefit performance. / text Fault-tolerant computing Computers, Pipeline--Reliability Data flow computing Computer architecture
133	Adaptable stateful application server replication Wu, Huaigu, 1975- January 2008 (has links) In recent years, multi-tier architectures have become the standard computing environment for web- and enterprise applications. The application server tier is often the heart of the system embedding the business logic. Adaptability, in particular the capability to adjust to the load submitted to the system and to handle the failure of individual components, are of outmost importance in order to provide 7/24 access and high performance. Replication is a common means to achieve these reliability and scalability requirements. With replication, the application server tier consists of several server replicas. Thus, if one replica fails, others can take over. Furthermore, the load can be distributed across the available replicas. Although many replication solutions have been proposed so far, most of them have been either developed for fault-tolerance or for scalability. Furthermore, only few have considered that the application server tier is only one tier in a multi-tier architecture, that this tier maintains state, and that execution in this environment can follow complex patterns. Thus, existing solutions often do not provide correctness beyond some basic application scenarios. / In this thesis we tackle the issue of replication of the application server tier from ground off and develop a unified solution that provides both fault-tolerance and scalability. We first describe a set of execution patterns that describe how requests are typically executed in multi-tier architectures. They consider the flow of execution across client tier, application server tier, and database tier. In particular, the execution patterns describe how requests are associated with transactions, the fundamental execution units at application server and database tiers. Having these execution patterns in mind, we provide a formal definition of what it means to provide a correct execution across all tiers, even in case failures occur and the application server tier is replicated. Informally, a replicated system is correct if it behaves exactly as a non-replicated that never fails. From there, we propose a set of replication algorithms for fault-tolerance that provide correctness for the execution patterns that we have identified The main principle is to let a primary AS replica to execute all client requests, and to propagate any state changes performed by a transaction to backup replicas at transaction commit time. The challenges occur as requests can be associated in different ways with transactions. Then, we extend our fault-tolerance solution and develop a unified solution that provides both fault-tolerance and load-balancing. In this extended solution, each application server replica is able to execute client requests as a primary and at the same time serves as backup for other replicas. The framework provides a transparent, truly distributed and lightweight load distribution mechanism which takes advantage of the fault-tolerance infrastructure. Our replication tool is implemented as a plug-in of JBoss application server and the performance is carefully evaluated, comparing with JBoss' own replication solutions. The evaluation shows that our protocols have very good performance and compare favorably with existing solutions. Client/server computing. Internet programming. Computer systems -- Reliability. Fault-tolerant computing.
134	Automatically increasing fault tolerance in distributed systems Bazzi, Rida Adnan January 1994 (has links) No description available. Fault-tolerant computing
135	Support for fault-tolerant computations in distributed object systems Chelliah, Muthusamy January 1996 (has links) No description available. Fault-tolerant computing
136	Design and evaluation of a distributed diagnosis algorithm for arbitrary network topologies in dynamic fault environments Subbiah, Arun 12 1900 (has links) No description available. Fault-tolerant computing
137	Optimized error coverage in built-in self-test by output data modification Zorian, Yervant January 1987 (has links) The concept of Built-In Self-Test (BIST) has recently become an increasingly attractive solution to the complex problem of testing VLSI chips. However, the realization of BIST faces some challenging problems of its own. One of these problems is to increase the quality of fault coverage of a BIST implementation, without incurring a large overhead. In particular, the loss of information in the output data compressor, which is typically a multi-input linear feedback shift register (MISR), is a major cause of concern. / In the recent past, several researchers have proposed different schemes to reduce this loss of information, while maintaining the need for a small area overhead. / In this dissertation, a new BIST scheme, based on modifying the output data before compression, is developed. This scheme, called output data modification (ODM), exploits the knowledge of the functionality of the circuit under test to provide a circuit-specific BIST structure. This structure is developed so that it can conveniently be implemented for any general circuit under consideration. But more importantly, a proof of effectiveness is provided to show that ODM will, on the average, be orders of magnitude better than all existing schemes in its capability to reduce the information loss, for a given amount of area overhead. / Moreover, the constructive nature of the proof will allow one to provide a simple trade-off between the reduction tolerated in information loss to the area overhead needed to affect this reduction. Fault-tolerant computing Integrated circuits -- Masks
138	Semi-automatic fault localization Jones, James Arthur 17 January 2008 (has links) One of the most expensive and time-consuming components of the debugging process is locating the errors or faults. To locate faults, developers must identify statements involved in failures and select suspicious statements that might contain faults. In practice, this localization is done by developers in a tedious and manual way, using only a single execution, targeting only one fault, and having a limited perspective into a large search space. The thesis of this research is that fault localization can be partially automated with the use of commonly available dynamic information gathered from test-case executions in a way that is eﬀective, eﬃcient, tolerant of test cases that pass but also execute the fault, and scalable to large programs that potentially contain multiple faults. The overall goal of this research is to develop eﬀective and eﬃcient fault localization techniques that scale to programs of large size and with multiple faults. There are three principle steps performed to reach this goal: (1) Develop practical techniques for locating suspicious regions in a program; (2) Develop techniques to partition test suites into smaller, specialized test suites to target speciﬁc faults; and (3) Evaluate the usefulness and cost of these techniques. In this dissertation, the diﬃculties and limitations of previous work in the area of fault-localization are explored. A technique, called Tarantula, is presented that addresses these diﬃculties. Empirical evaluation of the Tarantula technique shows that it is eﬃcient and eﬀective for many faults. The evaluation also demonstrates that the Tarantula technique can loose eﬀectiveness as the number of faults increases. To address the loss of eﬀectiveness for programs with multiple faults, supporting techniques have been developed and are presented. The empirical evaluation of these supporting techniques demonstrates that they can enable eﬀective fault localization in the presence of multiple faults. A new mode of debugging, called parallel debugging, is developed and empirical evidence demonstrates that it can provide a savings in terms of both total expense and time to delivery. A prototype visualization is provided to display the fault-localization results as well as to provide a method to interact and explore those results. Finally, a study on the eﬀects of the composition of test suites on fault-localization is presented. Software engineering Debugging Visualization Testing Fault-tolerant computing Debugging in computer science
139	MPLS inter-domain protection using domain boundary local bypass tunnels / Messier, Donald, January 1900 (has links) Thesis (M.Eng.) - Carleton University, 2002. / Includes bibliographical references (p. 90-92). Also available in electronic format on the Internet.
140	Développement et étude d'un système d'exploitation tolérant aux défaillances pour système un multiprocesseur / Gagnon, Nicolas, January 1997 (has links) Mémoire (M.Eng.)--Université du Québec à Chicoutimi, 1997. / Document électronique également accessible en format PDF. CaQCU

Search results