Global ETD Search

101	Preemptive Placement and Routing for In-Field FPGA Repair Jensen, Joshua E. 01 March 2015 (has links) (PDF) With the growing density and shrinking feature size of modern semiconductors, it is increasingly difficult to manufacture defect free semiconductors that maintain acceptable levels of reliability for long periods of time. These systems are increasingly susceptible to wear-out by failing to meet their operational specifications for an extended period of time. The reconfigurability of FPGAs can be used to repair post-manufacturing faults by configuring the FPGA to avoid a damaged resource. This thesis presents a method for preemptively preparing to repair FPGA devices with wear-out faults by precomputing a set of repair circuits that, collectively, can repair a fault found in any logic block of the FPGA. This approach relies on logic placement and routing to create “repair” circuits that avoid specific logic blocks. These repairs can be used when a specific resource has failed. New placement and routing algorithms are proposed for generating such repair circuits. The number of repairs needed to create a complete repair set depends heavily on the utilization of the FPGA resources. The algorithms are tested against several benchmarks and with multiple area constraints for each benchmark. Using this work, on average 20 repair configurations were needed to repair 99% of permanent faults. FPGA Repair Fault-Tolerance Placement Routing Electrical and Computer Engineering
102	Fault-tolerance in HLA-based distributed simulations Eklöf, Martin January 2006 (has links) Successful integration of simulations within the Network-Based Defence (NBD), specifically use of simulations within Command and Control (C2) environments, enforces a number of requirements. Simulations must be reliable and be able to respond in a timely manner. Otherwise the commander will have no confidence in using simulation as a tool. An important aspect of these requirements is the provision of fault-tolerant simulations in which failures are detected and resolved in a consistent manner. Given the distributed nature of many military simulations systems, services for fault-tolerance in distributed simulations are desirable. The main architecture for distributed simulations within the military domain, the High Level Architecture (HLA), does not provide support for development of fault-tolerant simulations. A common approach for fault-tolerance in distributed systems is check-pointing. In this approach, states of the system are persistently stored through-out its operation. In case a failure occurs, the system is restored using a previously saved state. Given the abovementioned shortcomings of the HLA standard this thesis explores development of fault-tolerant mechanisms in the context of the HLA. More specifically, the design, implementation and evaluation of fault-tolerance mechanisms, based on check-pointing, are described and discussed. / QC 20101111 HLA fault-tolerance distributed simulations federate federation Computer Engineering Datorteknik
103	A Competitive Reconfiguration Approach To Autonomous Fault Handling Using Genetic Algorithms Zhang, Kening 01 January 2008 (has links) In this dissertation, a novel self-repair approach based on Consensus Based Evaluation (CBE) for autonomous repair of SRAM-based Field Programmable Gate Arrays (FPGAs) is developed, evaluated, and refined. An initial population of functionally identical (same input-output behavior), yet physically distinct (alternative design or place-and-route realization) FPGA configurations is produced at design time. During run-time, the CBE approach ranks these alternative configurations after evaluating their discrepancy relative to the consensus formed by the population. Through runtime competition, faults in the logical resources become occluded from the visibility of subsequent FPGA operations. Meanwhile, offspring formed through crossover and mutation of faulty and viable configurations are selected at a controlled re-introduction rate for evaluation and refurbishment. Refurbishments are evolved in-situ, with online real-time input-based performance evaluation, enhancing system availability and sustainability, creating an Organic Embedded System (OES). A fault tolerance model called N Modular Redundancy with Standby (NMRSB) is developed which combines the two popular fault tolerance techniques of NMR and Standby fault tolerance in order to facilitate the CBE approach. This dissertation develops two of instances of the NMRSB system - Triple Modular Redundancy with Standby (TMRSB) and Duplex with Standby (DSB). A hypothetical Xilinx Virtex-II Pro FPGA model demonstrates their viability for various applications including a 3-bit x 3-bit multiplier, and the MCNC91 benchmark circuits. Experiments conducted on the model iii evaluate the performance of three new genetic operators and demonstrate progress towards a completely self-contained single-chip implementation so that the FPGA can refurbish itself without requiring a PC host to execute the Genetic Algorithm. This dissertation presents results from the simulations of multiple applications with a CBE model implemented in the C++ programming language. Starting with an initial population of 20 and 30 viable configurations for TMRSB and DSB respectively, a single stuck-at fault is introduced in the logic resources. Fault refurbishment experiments are conducted under supervision of CBE using a fitness state evaluation function based on competing outputs, fitness adjustment, and different level threshold. The device remains online throughout the process by which a complete repair is realized with Hamming Distance and Bitweight voting schemes. The results indicate a Hamming Distance TMRSB approach can prevent the most pervasive fault impacts and realize complete refurbishment. Experimental results also show that the Autonomic Layer demonstrates 100% faulty component isolation for both Functional Elements (FEs) and Autonomous Elements (AEs) with randomly injected single and multiple faults. Using logic circuits from the MCNC-91 benchmark set, availability during repair phases averaged 75.05%, 82.21%, and 65.21% for the z4ml, cm85a, and cm138a circuits respectively under stated conditions. In addition to simulation, the proposed OES architecture synthesized from HDL was prototyped on a Xilinx Virtex II Pro FPGA device supporting partial reconfiguration to demonstrate the feasibility for intrinsic regeneration of the selected circuit. Computer Engineering Engineering
104	Cross-Layer Fault-Tolerant Design and Analysis for High Manufacturing Yield and System Reliability Guo, Jianghao 26 May 2016 (has links) No description available. Computer Engineering Fault tolerance cross layer computer architecture system performance
105	Implementation of Logic Fault Tolerance on a Dynamically Reconfigurable FPGA Jayarama, Kiran January 2016 (has links) No description available. Electrical Engineering Reconfigurable FPGA Fault Tolerance Dynamically Reconfigurable FPGA
106	A Foundation for Fault Tolerant Components Leal, William Milo 17 December 2001 (has links) No description available. Computer Science component fault tolerance refinement tolerance refinement locality composition
107	Scalable design of fault-tolerance for wireless sensor networks Demirbas, Murat 29 September 2004 (has links) No description available. Computer Science Fault-tolerance Sensor networks Self-stabilization
108	High performance and network fault tolerant MPI with multi-pathing over infiniBand Vishnu, Abhinav 11 December 2007 (has links) No description available. Computer Science InfiniBand MPI Network Fault Tolerance Hot-Spot
109	Network Fault Resilient MPI for Multi-Rail Infiniband Clusters Pai Raikar, Siddhesh Prakash Sunita January 2011 (has links) No description available. Computer Science Multirail Multi-rail fault tolerance resilience
110	FEASIBILITY STUDIES OF STATISTIC MULTIPLEXED COMPUTING Celik, Yasin January 2018 (has links) In 2012, when Professor Shi introduced me to the concept of Statistic Multiplexed Computing (SMC), I was skeptical. It contradicted everything I have learned and heard about distributed and parallel computing. However, I did believe that unhandled failures in any application will negatively impact its scalability. For that, I agreed to take on the feasibility study of SMC for practical applications. After six+ years research and experimentations, it became clear to me that the most widely believed misconception is “either performance or reliability” when upscaling a distributed application. This conception was the result of the direct use of hop-by-hop communication protocols in distributed application construction. Terminology: Hop-by-hop data protocol is a two-sided reliable lossless data communication protocol for transmitting data between a sender and a receiver. Either the sender or the receiver crash will cause data losses. Examples: MPI, RPC, RMI, OpenMP. End-to-end data protocol is a single-sided reliable lossless data communication protocol for transmitting data between application programs. All runtime available processors, networks and storage will be automatically dispatched to the best effort support of the reliable communication regardless transient and permanent device failures. Examples: HDFS, Blockchain, Fabric and SMC. Active end-to-end data protocol is a single-sided reliable lossless data communication pro- tocol for transmitting data and automatically synchronizing application programs. Example: SMC (AnkaCom, AnkaStore (this dissertation)). Unlike the hop-by-hop protocols, the use of end-to-end protocol forms an application- dependent overlay network. An overlay network for distributed and parallel computing application, such as Blockchain, has been proven to defy the “common wisdom” for two important distributed computing challenges: a) Extreme scale computing without single-point failures is practically feasible. Thus, all transaction or data losses can be eliminated. b) Extreme scale synchronized transaction replication is practically feasible. Thus, the CAP conjecture and theorem become irrelevant. Unlike passive overlay networks, such as the HDFS and Blockchain, this dissertation study proves that an active overlay network can deliver higher performance, higher reliability and security at the same time as the application up scales. Although application-level security is not part of this dissertation, it is easy to see that application-level end-to-end protocols will fundamentally eliminate the “man-in-the-middle” attacks. This will nullify many well-known attacks. With the zero-single-point failure and zero impact synchronous replication features, SMC applications are naturally resistant to DDoS and ransomware attacks. This dissertation explores practical implementations of the SMC concept for compute intensive (CI) and data intensive (DI) applications. This defense will disclose the details of CI and DI runtime implementations and results of inductive computational experiments. The computational environments include the NSF Chameleon bare-metal HPC cloud and Temple’s TCloud cluster. / Computer and Information Science Computer Science Distributed Systems Fault Tolerance Hpc Reliability Scalability Storage

Search results