Global ETD Search

1	The analysis and synthesis of dependable digital systems Philp, Kenneth January 1996 (has links) No description available. 621.39 Fault tolerance
2	Smart distributed systems Perez, Hector Benitez January 1999 (has links) No description available. 629.8 Fault tolerance
3	Dual dissimilar processor system for high reliability Patel, D. C. January 1982 (has links) No description available. 621.39 Microprocessor fault tolerance
4	Interfaces for embedded parallel multiprocessor networks Triger, Simon January 2002 (has links) No description available. 621 Fault tolerance
5	Performance optimizations for compiler-based error detection Mitropoulou, Konstantina January 2015 (has links) The trend towards smaller transistor technologies and lower operating voltages stresses the hardware and makes transistors more susceptible to transient errors. In future systems, performance and power gains will come at the cost of unreliable areas on the chip. For this reason, there is an increased need for low-overhead highly-reliable error detection methodologies. In the last years, several techniques have been proposed. The majority of them are based on redundancy which can be implemented at several levels (e.g., hardware, instruction, thread, process, etc). In instruction-level error detection approaches, the compiler replicates the instructions of the program and inserts checks wherever they are needed. The checks evaluate code correctness and decide whether or not an error has occurred. This type of error detection is more flexible than the hardware alternatives. It allows the programmer to choose the protected area of the program and it can be applied without any hardware modifications. On the other hand, the replicated instructions and the checks cause a large slowdown making software techniques less appealing. In this thesis, we propose two techniques that aim at reducing the error detection overhead of compiler-based approaches and improving system’s performance without sacrificing the fault-coverage. The first technique, DRIFT, achieves this by decoupling the execution of the code (original and replicated) from the checks. The checks are compare and jump instructions. The latter ones tend to make the code sequential and prohibit the compiler from performing aggressive instruction scheduling optimizations. We call this phenomenon basic-block fragmentation. DRIFT reduces the impact of basic-block fragmentation by breaking the synchronized execute-check-confirm-execute cycle. In this way, DRIFT generates a scheduler-friendly code with more instruction-level parallelism (ILP). As a result, it reduces the performance overhead down to 1.29× (on average) and outperforms the state-of-the-art by up to 29.7% retaining the same fault-coverage. Next, CASTED focuses on reducing the impact of error detection overhead on single-chip scalable architectures that are composed of tightly-coupled cores. The proposed compiler methodology adaptively distributes the error detection overhead to the available resources across multiple cores, fully exploiting the abundant ILP of these architectures. CASTED adapts to a wide range of architecture configurations (issue-width, inter-core communication). The results show that CASTED matches the performance of, and often outperforms, sometimes by as mush as 21.2%, the best fixed state-of-the-art approach while maintaining the same fault coverage. 005.75 fault tolerance ; compiler
6	Test and fault-tolerance for network-on-chip infrastructures Grecu, Cristian 05 1900 (has links) The demands of future computing, as well as the challenges of nanometer-era VLSI design, will require new design techniques and design styles that are simultaneously high performance, energy-efficient, and robust to noise and process variation. One of the emerging problems concerns the communication mechanisms between the increasing number of blocks, or cores, that can be integrated onto a single chip. The bus-based systems and point-to-point interconnection strategies in use today cannot be easily scaled to accommodate the large numbers of cores projected in the near future. Network-on-chip (NoC) interconnect infrastructures are one of the key technologies that will enable the emergence of many-core processors and systems-on-chip with increased computing power and energy efficiency. This dissertation is focused on testing, yield improvement and fault-tolerance of such NoC infrastructures. A fast, efficient test method is developed for NoCs, that exploits their inherent parallelism to reduce the test time by transporting test data on multiple paths and testing multiple NoC components concurrently. The improvement of test time varies, depending on the NoC architecture and test transport protocol, from 2X to 34X, compared to current NoC test methods. This test mechanism is used subsequently to perform detection of NoC link permanent faults, which are then repaired by an on-chip mechanism that replaces the faulty signal lines with fault-free ones, thereby increasing the yield, while maintaining the same wire delay characteristics. The solution described in this dissertation improves significantly the achievable yield of NoC inter-switch channels â from 4% improvement for an 8-bit wide channel, to a 71% improvement for a 128-bit wide channel. The direct benefit is an improved fault-tolerance and increased yield and long-term reliability of NoC based multicore systems. Network-on-chip Fault tolerance
7	Supporting fault-tolerant communication in networks Kanjani, Khushboo 15 May 2009 (has links) We address two problems dealing with fault-tolerant communication in networks. The first one is designing a distributed storage protocol tolerant to Byzantine failure of servers. The protocol implements a multi-writer multi-reader register which satisfies a weaker consistency condition called MWReg. Most of the earlier work gives multiwriter implementations by simulating m copies of a single-writer protocol where m is the number of writers. Our solution gives a direct multi-writer implementation and thus has bounded message and time complexity independent of the number of writers. We have simulated the complete protocol to test its performance and also proved its correctness theoretically. The second problem we address is of providing a reliable communication link between two nodes in a network. We present a capacity reservation algorithm in the case for upper bounds on edge capacities and costs associated with using per unit capacity on any edge. We give a flow based approximation algorithm with cost at most four times optimal. To conclude, we design a distributed storage protocol and a capacity reservation algorithm which are tolerant to network failures. Distributed Storage fault-tolerance
8	Replicating multithreaded services Kapritsos, Emmanouil 09 February 2015 (has links) For the last 40 years, the systems community has invested a lot of effort in designing techniques for building fault tolerant distributed systems and services. This effort has produced a massive list of results: the literature describes how to design replication protocols that tolerate a wide range of failures (from simple crashes to malicious "Byzantine" failures) in a wide range of settings (e.g. synchronous or asynchronous communication, with or without stable storage), optimizing various metrics (e.g. number of messages, latency, throughput). These techniques have their roots in ideas, such as the abstraction of State Machine Replication and the Paxos protocol, that were conceived when computing was very different than it is today: computers had a single core; all processing was done using a single thread of control, handling requests sequentially; and a collection of 20 nodes was considered a large distributed system. In the last decade, however, computing has gone through some major paradigm shifts, with the advent of multicore architectures and large cloud infrastructures. This dissertation explains how these profound changes impact the practical usefulness of traditional fault tolerant techniques and proposes new ways to architect these solutions to fit the new paradigms. / text Fault tolerance Replication Multithreading
9	A reconfiguration-based defect-tolerant design paradigm for nanotechnologies He, Chen 28 August 2008 (has links) Not available / text Nanotechnology Fault tolerance (Engineering)
10	Reliable mobile agents for distributed computing Wagealla, Waleed January 2003 (has links) The emergence of platform-independent, mobile code technologies has created big opportunities for Internet-based applications. Mobile agents are being utilized to perform a variety of tasks from personalized computing to business-critical transactions. Unfortunately, these advances were not matched by correspondent research into the reliability of these new technologies. This work has been undertaken to investigate the faulttolerance of this new paradigm. Agent programs' mobility and autonomy of execution has introduced a new class of failures different to that of traditional distributed systems. Therefore, fault tolerance is one of the main problems that must be resolved to improve the adoption of an agents' paradigm. The investigation of mobile agents reliability in this thesis resulted in the development of REMA (REliable Mobile Agents), which guarantees the reliable execution, migration, and communication of mobile agents in the presence of faults that might affect the agents hosts or their communication network. We introduced an algorithm for the transparent detection of faults that might affect agent execution, migration, and communication. A decentralized structure was used to divide the agent dynamic distributed system into network-partitioning proof spaces. Lightweight messaging was adopted as the basic error detection engine, which together with the loosely coupled detection managers provided an efficient, low overhead detection mechanism for agent-based distributed processing. The problem of taking checkpoint of agent execution is hampered by the lack of the accessibility of the underlying structure of the JVM. Thus, an alternative solution has been achieved through the REMA Checkpoint and Recovery (REMA-CR) package. REMA-CR provides the developer with powerful classes and methods that allow for capturing the critical data of agents' execution. The developed recovery protocol offers a communication-pairs, independent checkpointing strategy at a low-cost, that covers all possible faults that might invalidate reliable agent execution, migration and communication and maintains the exactly once execution property. The results and the performance of REMA confirmed our objectives of providing a fault tolerant wrapper for agents and their applications with acceptable overhead cost. 006 Fault tolerance

Search results