Global ETD Search

161	Cost-effective dynamic repair for FPGAs in real-time systems / Reparo dinâmico de baixo custo para FPGAs em sistemas tempo-real Santos, Leonardo Pereira January 2016 (has links) Field-Programmable Gate Arrays (FPGAs) são largamente utilizadas em sistemas digitais por características como flexibilidade, baixo custo e alta densidade. Estas características advém do uso de células de SRAM na memória de configuração, o que torna estes dispositivos suscetíveis a erros induzidos por radiação, tais como SEUs. TMR é o método de mitigação mais utilizado, no entanto, possui um elevado custo tanto em área como em energia, restringindo seu uso em aplicações de baixo custo e/ou baixo consumo. Como alternativa a TMR, propõe-se utilizar DMR associado a um mecanismo de reparo da memória de configuração da FPGA chamado scrubbing. O reparo de FPGAs em sistemas em tempo real apresenta desafios específicos. Além da garantia da computação correta dos dados, esta computação deve se dar completamente dentro do tempo disponível (time-slot), devendo ser finalizada antes do tempo limite (deadline). A diferença entre o tempo de computação dos dados e a deadline é chamado de slack e é o tempo disponível para reparo do sistema. Este trabalho faz uso de scrubbing deslocado dinâmico, que busca maximizar a probabilidade de reparo da memória de configuração de FPGAs dentro do slack disponível, baseado em um diagnóstico do erro. O scrubbing deslocado já foi utilizado com técnicas de diagnóstico de grão fino (NAZAR, 2015). Este trabalho propõe o uso de técnicas de diagnóstico de grão grosso para o scrubbing deslocado, evitando as penalidades de desempenho e custos em área associados a técnicas de grão fino. Circuitos do conjunto MCNC foram protegidos com as técnicas propostas e submetidos a seções de injeção de erros (NAZAR; CARRO, 2012a). Os dados obtidos foram analisados e foram calculadas as melhores posição iniciais do scrubbing para cada um dos circuitos. Calculou-se a taxa de Failure-in-Time (FIT) para comparação entre as diferentes técnicas de diagnóstico propostas. Os resultados obtidos confirmaram a hipótese inicial deste trabalho que a redução do número de bits sensíveis e uma baixa degradação do período do ciclo de relógio permitiram reduzir a taxa de FIT quando comparadas com técnicas de grão fino. Por fim, uma comparação entre as três técnicas propostas é feita, analisando o desempenho e custos em área associados a cada uma. / Field-Programmable Gate Arrays (FPGAs) are widely used in digital systems due to characteristics such as flexibility, low cost and high density. These characteristics are due to the use of SRAM memory cells in the configuration memory, which make these devices susceptible to radiation-induced errors, such as SEUs. TMR is the most used mitigation technique, but it has an elevated cost both in area as well as in energy, restricting its use in low cost/low energy applications. As an alternative to TMR, we propose the use of DMR associated with a repair mechanism of the FPGA configuration memory called scrubbing. The repair of FPGA in real-time systems present a specific set of challenges. Besides guaranteeing the correct computation of data, this computation must be completely carried out within the available time (time-slot), being finalized before a time limit (deadline). The difference between the computation time and the deadline is called the slack and is the time available to repair the system. This work uses a dynamic shifted scrubbing that aims to maximize the repair probability of the configuration memory of the FPGA within the available slack based on error diagnostic. The shifted scrubbing was already proposed with fine-grained diagnostic techniques (NAZAR, 2015). This work proposes the use of coarse-grained diagnostic technique as a way to avoid the performance penalties and area costs associated to fine-grained techniques. Circuits of the MCNC suite were protected by the proposed techniques and subject to error-injection campaigns (NAZAR; CARRO, 2012a). The obtained data was analyzed and the best scrubbing starting positions for each circuit were calculated. The Failure-in-Time (FIT) rates were calculated to compare the different proposed diagnostic techniques. The obtained results validated the initial hypothesis of this work that the reduction of the number of sensitive bits and a low degradation of the clock cycle allowed a reduced FIT rate when compared with fine-grained diagnostic techniques. Finally, a comparison is made between the proposed techniques, considering performance and area costs associated to each one. Microeletrônica Sistemas : Tempo real Field-programmable gate arrays (FPGA) Scrubbing Fault diagnosis Fault tolerance Real-time
162	The coordinated control of autonomous agents Abel, Ryan Orlin 01 December 2010 (has links) This thesis considers the coordinated control of autonomous agents. The agents are modeled as double integrators, one for each Cartesian dimension. The goal is to force the agents to converge to a formation specified by their desired relative positions. To this end a pair of one-step-ahead optimization based control laws are developed. The control algorithms produce a communication topology that mirrors the geometric formation topology due to the careful choice of the minimized cost functions. Through this equivalence a natural understanding of the relationship between the geometric formation topology and the communication infrastructure is gained. It is shown that the control laws are stable and guarantee convergence for all viable formation topologies. Additionally, velocity constraints can be added to allow the formation to follow fixed or arbitrary time dependent velocities. Both control algorithms only require local information exchange. As additional agents attach to the formation, only those agents that share position constraints with the joining agents need to adjust their control laws. When redundancy is incorporated into the formation topology, it is possible for the system to survive loss of agents or communication channels. In the event that an agent drops out of the formation, only the agents with position interdependence on the lost agent need to adjust their control laws. Finally, if a communication channel is lost, only the agents that share that communication channel must adjust their control laws. The first control law falls into the category of distributed control, since it requires either the global information exchange to compute the formation size or an a priori knowledge of the largest possible formation. The algorithm uses the network size to penalize the control input for each formation. When using a priori knowledge, it is shown that additional redundancy not only adds robustness to loss of agents or communication channels, but it also decreases the settling times to the desired formation. Conversely, the overall control strategy suffers from sluggish response when the network is small with respect to the largest possible network. If global information exchange is used, scalability suffers. The second control law was developed to address the negative aspects of the first. It is a fully decentralized controller, as it does not require global information exchange or any a priori knowledge. Autonomus Agents Co-operative Control Decentralized Control Fault Tolerance Optimal Control Stability Electrical and Computer Engineering
163	Assessing Apache Spark Streaming with Scientific Data Dahal, Janak 06 August 2018 (has links) Processing real-world data requires the ability to analyze data in real-time. Data processing engines like Hadoop come short when results are needed on the fly. Apache Spark's streaming library is increasingly becoming a popular choice as it can stream and analyze a significant amount of data. To showcase and assess the ability of Spark various metrics were designed and operated using data collected from the USGODAE data catalog. The latency of streaming in Apache Spark was measured and analyzed against many nodes in the cluster. Scalability was monitored by adding and removing nodes in the middle of a streaming job. Fault tolerance was verified by stopping nodes in the middle of a job and making sure that the job was rescheduled and completed on other node/s. A full stack application was designed that would automate data collection, data processing and visualizing the results. Google Maps API was used to visualize results by color coding the world map with values from various analytics. Other Computer Sciences
164	Quadded GasP: a Fault Tolerant Asynchronous Design Scheiblauer, Kristopher S. 27 February 2017 (has links) As device scaling continues, process variability and defect densities are becoming increasingly challenging for circuit designers to contend with. Variability reduces timing margins, making it difficult and time consuming to meet design specifications. Defects can cause degraded performance or incorrect operation resulting in circuit failure. Consequently test times are lengthened and production yields are reduced. This work assess the combination of two concepts, self-timed asynchronous design and fault tolerance, as a possible solution to both variability and defects. Asynchronous design is not as sensitive to variability as synchronous, while fault tolerance allows continued functional operation in the presence of defects. GasP is a self-timed asynchronous design that provides high performance in a simple circuit. Quadded Logic, is a gate level fault tolerant methodology. This study presents Quadded GasP, a fault tolerant asynchronous design. This work demonstrates that Quadded GasP circuits continue to function within performance expectations when faults are present. The increased area and reduced performance costs of Quadded GasP area also evaluated. These results show Quadded GasP circuits are a viable option for managing process variation and defects. Application of these circuits will provide decreased development and test times, as well as increased yield. Fault tolerance (Engineering) Asynchronous circuits Electrical and Computer Engineering
165	Timing Predictability in Future Multi-Core Avionics Systems Löfwenmark, Andreas January 2017 (has links) With more functionality added to safety-critical avionics systems, new platforms are required to offer the computational capacity needed. Multi-core platforms offer a potential that is now being explored, but they pose significant challenges with respect to predictability due to shared resources (such as memory) being accessed from several cores in parallel. Multi-core processors also suffer from higher sensitivity to permanent and transient faults due to shrinking transistor sizes. This thesis addresses several of these challenges. First, we review major contributions that assess the impact of fault tolerance on worst-case execution time of processes running on a multi-core platform. In particular, works that evaluate the timing effects using fault injection methods. We conclude that there are few works that address the intricate timing effects that appear when inter-core interferences due to simultaneous accesses of shared resources are combined with the fault tolerance techniques. We assess the applicability of the methods to COTS multi-core processors used in avionics. We identify dark spots on the research map of the joint problem of hardware reliability and timing predictability for multi-core avionics systems. Next, we argue that the memory requests issued by the real-time operating systems (RTOS) must be considered in resource-monitoring systems to ensure proper execution on all cores. We also adapt and extend an existing method for worst-case response time analysis to fulfill the specific requirements of avionics systems. We relax the requirement of private memory banks to also allow cores to share memory banks. Timing Predictability WCET WCRT Fault Tolerance Safety Critical Avionics Computer Sciences Datavetenskap (datalogi)
166	Extensions and refinements of stabilization Dasgupta, Anurag 01 December 2009 (has links) Self-stabilizing system is a concept of fault-tolerance in distributed computing. A distributed algorithm is self-stabilizing if, starting from an arbitrary state, it is guaranteed to converge to a legal state in a finite number of states and remains in a legal set of states thereafter. The property of self-stabilization enables a distributed algorithm to recover from a transient fault regardless of its objective. Moreover, a self-stabilizing algorithm does not have to be initialized as it eventually starts to behave correctly. In this thesis, we focus on extensions and refinements of self-stabilization by studying two non-traditional aspects of self-stabilization. In traditional self-stabilizing distributed systems [13], the inherent assumption is that all processes run predefined programs mandated by an external agency which is the owner or the administrator of the entire system. The model works fine for solving problems when processes cooperate with one another, with a global goal. In modern times it is quite common to have a distributed system spanning over multiple administrative domains, and processes have selfish motives to optimize their own pay- off. Maximizing individual payoffs under the umbrella of stabilization characterizes the notion of selfish stabilization . We investigate the impact of selfishness on the existence and the complexity of stabilizing solutions to specific problems in this thesis. Our model of selfishness centers on a graph where the set of nodes is divided into subsets of distinct colors, each having their own unique perception of the edge costs. We study the problems of constructing a rooted shortest path tree and a maximum flow tree on this model, and demonstrate that when processes are selfish, there is no guarantee that a solution will exist. We demonstrate that the complexity of determining the existence of a stabilizing solution is NP-complete, carefully characterize a fraction of such cases, and propose the construction of stabilizing solutions wherever such solutions are feasible. Fault containment and system availability are important issues in today's distributed systems. In this thesis, we show how fault-containment can be added to weakly stabilizing distributed systems. We present solutions using a randomized scheduler, and illustrate techniques to bias the random schedules so that the system recovers from all single faults in a time independent of the size of the system, and the effect of the failure is contained within constant distance from the faulty node with high probability (this probability can be controlled by a user defined tuning parameter). Using this technique, we solve two problems: one is the persistent-bit problem, and the other is the leader election problem. fault tolerance fault-containment self-stabilization selfish stabilization stabilization Computer Sciences
167	Safety and hazard analysis in concurrent systems Rao, Shrisha 01 January 2005 (has links) Safety is a well-known and important class of property of software programs, and of systems in general. The basic notion that informs this work is that the time to think about safety is when it still exists but could be lost. The notion is not just to analyse safety as existing or not with a given system state, but also in the sense that a system is presently safe but becoming less so. Safety as considered here is not restricted to one type of property, and indeed for much of the discussion it does not matter what types of measures are used to assess safety. The work done here is for the purpose of laying a theoretical and mathematical foundation for allowing static analyses of systems to further safety. This is done using tools from lattice theory applied to the poset of system states partially ordered by reachability. Such analyses are common (e.g., with abstract interpretations of software functioning) with respect to other kinds of systems, but there does not seem to exist a formalism that permits them specifically for safety. Using the basic analytical tools developed, a study is made of the problem of composing systems from components. Three types of composition: direct sum, direct product, and exponentiation---are noted, and the first two are treated in some depth. It is shown that the set of all systems formed with the direct sum and direct product operators can be specified by a BNF grammar. A three-valued ``safety logic'' is specified, using which the safety and fault-tolerance of composed systems can be computed given the system composition. It is also shown that the set of all systems also forms separate monoids (in the sense familiar to mathematicians), and that other monoids can be derived based on equivalence classes of systems. The example of a train approaching a railroad crossing, where a gate must be closed prior to the train's arrival and opened after its exit, is considered and analysed as an example. distributed computing concurrency formal methods safety fault tolerance lattice theory Computer Sciences
168	Analysis of How Mobile Robots Fail in the Field Carlson, Jennifer 03 March 2004 (has links) The considerable risk to human life associated with modern military operations in urban terrain (MOUT) and urban search and rescue (USAR) has led professionals in these domains to explore the use of robots to improve safety. Recent studies on mobile robot use in the field have shown a noticeable lack of reliability in real field conditions. Improving mobile robot reliability for applications such as USAR and MOUT requires an understanding of how mobile robots fail in field environments. This paper provides a detailed investigation of how ground-based mobile robots fail in the field. Forty-four representative examples of failures from 13 studies of mobile robot reliability in the USAR and MOUT domains are gathered, examined, and classified. A novel taxonomy sufficient to cover any failure a ground-based mobile robot may encounter in the field is presented. This classification scheme draws from established standards in the dependability computing [30] and human-computer interaction [40] communities, as well as recent work [6] in the robotics domain. Both physical failures (failures within the robotic system) and human failures are considered. Overall robot reliability in field environments is low with between 6 and 20 hours mean time between failures (MTBF), depending on the criteria used to determine if a failure has occurred. Common issues with existing platforms appear to be the following: unstable control systems, chassis and effectors designed and tested for a narrow range of environmental conditions, limited wireless communication range in urban environments, and insufficient wireless bandwidth. Effectors and the control system are the most common sources of physical failures. Of the human failures examined, slips are more common than mistakes. Two-thirds of the failures examined in [6] and [7] could be repaired in the field. Failures which resulted in the suspension of the robot's task until the repair was completed are also more common with 94% of the failures reported in [13]. fault tolerance robotics reliability analysis meta-study field work American Studies Arts and Humanities
169	Fault-Tolerant Average Execution Time Optimization for General-Purpose Multi-Processor System-On-Chips Väyrynen, Mikael January 2009 (has links) <p>Fault tolerance is due to the semiconductor technology development important, not only for safety-critical systems but also for general-purpose (non-safety critical) systems. However, instead of guaranteeing that deadlines always are met, it is for general-purpose systems important to minimize the average execution time (AET) while ensuring fault tolerance. For a given job and a soft (transient) no-error probability, we define mathematical formulas for AET using voting (active replication), rollback-recovery with checkpointing (RRC) and a combination of these (CRV) where bus communication overhead is included. And, for a given multi-processor system-on-chip (MPSoC), we define integer linear programming (ILP) models that minimize the AET including bus communication overhead when: (1) selecting the number of checkpoints when using RRC or a combination where RRC is included, (2) finding the number of processors and job-to-processor assignment when using voting or a combination where voting is used, and (3) defining fault tolerance scheme (voting, RRC or CRV) per job and defining its usage for each job. Experiments demonstrate significant savings in AET.</p> Fault tolerance Execution time optimization Rollback recovery with checkpointing Active replication MPSoC Computer science Datavetenskap
170	A tool for automatic formal analysis of fault tolerance Nilsson, Markus January 2005 (has links) <p>The use of computer-based systems is rapidly increasing and such systems can now be found in a wide range of applications, including safety-critical applications such as cars and aircrafts. To make the development of such systems more efficient, there is a need for tools for automatic safety analysis, such as analysis of fault tolerance.</p><p>In this thesis, a tool for automatic formal analysis of fault tolerance was developed. The tool is built on top of the existing development environment for the synchronous language Esterel, and provides an output that can be visualised in the Item toolkit for fault tree analysis (FTA). The development of the tool demonstrates how fault tolerance analysis based on formal verification can be automated. The generated output from the fault tolerance analysis can be represented as a fault tree that is familiar to engineers from the traditional FTA analysis. The work also demonstrates that interesting attributes of the relationship between a critical fault combination and the input signals can be generated automatically.</p><p>Two case studies were used to test and demonstrate the functionality of the developed tool. A fault tolerance analysis was performed on a hydraulic leakage detection system, which is a real industrial system, but also on a synthetic system, which was modeled for this purpose.</p> Dependability Fault Tolerance Esterel Formal Verification System Safety Computer science Datalogi

Search results