Global ETD Search

281	Résilience des systèmes informatiques adaptatifs : modélisation, analyse et quantification / Resilience of adaptive computer systems : modelling, analysis and quantification Excoffon, William 08 June 2018 (has links) On appelle résilient un système capable de conserver ses propriétés de sûreté de fonctionnement en dépit des changements (nouvelles menaces, mise-à-jour,…). Les évolutions rapides des systèmes, y compris des systèmes embarqués, implique des modifications des applications et des configurations des systèmes, en particulier au niveau logiciel. De tels changements peuvent avoir un impact sur la sûreté de fonctionnement et plus précisément sur les hypothèses des mécanismes de tolérance aux fautes. Un système est donc résilient si de pareils changements n’invalident pas les mécanismes de sûreté de fonctionnement, c’est-à-dire, si les mécanismes déjà en place restent cohérents malgré les changements ou dont les incohérences peuvent être rapidement résolues. Nous proposons tout d’abord dans cette thèse un modèle pour les systèmes résilients. Grâce à ce modèle nous pourrons évaluer les capacités d’un ensemble de mécanismes de tolérance aux fautes à assurer les propriétés de sûreté issues des spécifications non fonctionnelles. Cette modélisation nous permettra également de définir un ensemble de mesures afin de quantifier la résilience d’un système. Enfin nous discuterons dans le dernier chapitre de la possibilité d’inclure la résilience comme un des objectifs du processus de développement / A system that remains dependable when facing changes (new threats, updates) is called resilient. The fast evolution of systems, including embedded systems, implies modifications of applications and system configuration, in particular at software level. Such changes may have an impact on the dependability of the system, in particular on the assumptions of the fault tolerance mechanisms. A system is resilient when such changes do not invalidate its dependability mechanisms, said in a different way, current dependability mechanisms remain appropriate despite changes or whose inconsistencies can be rapidly solved. We propose in this thesis a model for resilient computing systems. Using this model we propose a way to evaluate if a set of fault tolerance mechanisms is able to ensure dependability properties from non-functional specifications. The proposed model is the used to quantify the resilience of a system using a set of specific measures. In the last chapter we discuss the possibility of including resilience as a goal in development process Tolérance aux fautes Résilience Métriques d’évaluation Fault Tolerance Resilience Evaluation Metrics
282	Design and modeling of adaptive cruise control system using petri nets with fault tolerance capabilities Chandramohan, Nivethitha Amudha January 2018 (has links) Indiana University-Purdue University Indianapolis (IUPUI) / In automotive industry, driver assistance and active safety features are main areas of research. This thesis concentrates on designing one of the famous ADAS system feature called Adaptive cruise control. Feature development and analysis of various functionalities involved in the system control are done using Petri Nets. A background on the past and current ACC research is noted and taken as motivation. The idea is to implement the adaptive cruise control system in Petri net and analyze how to provide fault tolerance to the system. The system can be evaluated for various cases. The ACC technology implemented in di erent cars were compared and discussed. The interaction of the ACC module with other modules in the car is explained. The cruise system's algorithm in Petri net is used as the basis for developing Adaptive Cruise Control system's algorithm. The ACC system model is designed using Petri nets and various Petri net functionalities like place invariant, transition invariant and reachability tree of the model are analyzed. The results are veri ed using Matlab. Controllers are introduced for ideal cases and are implemented in Petri nets. Then the error cases are considered and fault tolerance techniques are carried out on the model to identify the fault places. Petri Nets ACC Cruise Control Fault tolerance ACC interface interface Active safety ADAS
283	On Fault Resilient Network-on-Chip for Many Core Systems Moriam, Sadia 24 May 2019 (has links) Rapid scaling of transistor gate sizes has increased the density of on-chip integration and paved the way for heterogeneous many-core systems-on-chip, significantly improving the speed of on-chip processing. The design of the interconnection network of these complex systems is a challenging one and the network-on-chip (NoC) is now the accepted scalable and bandwidth efficient interconnect for multi-processor systems on-chip (MPSoCs). However, the performance enhancements of technology scaling come at the cost of reliability as on-chip components particularly the network-on-chip become increasingly prone to faults. In this thesis, we focus on approaches to deal with the errors caused by such faults. The results of these approaches are obtained not only via time-consuming cycle-accurate simulations but also by analytical approaches, allowing for faster and accurate evaluations, especially for larger networks. Redundancy is the general approach to deal with faults, the mode of which varies according to the type of fault. For the NoC, there exists a classification of faults into transient, intermittent and permanent faults. Transient faults appear randomly for a few cycles and may be caused by the radiation of particles. Intermittent faults are similar to transient faults, however, differing in the fact that they occur repeatedly at the same location, eventually leading to a permanent fault. Permanent faults by definition are caused by wires and transistors being permanently short or open. Generally, spatial redundancy or the use of redundant components is used for dealing with permanent faults. Temporal redundancy deals with failures by re-execution or by retransmission of data while information redundancy adds redundant information to the data packets allowing for error detection and correction. Temporal and information redundancy methods are useful when dealing with transient and intermittent faults. In this dissertation, we begin with permanent faults in NoC in the form of faulty links and routers. Our approach for spatial redundancy adds redundant links in the diagonal direction to the standard rectangular mesh topology resulting in the hexagonal and octagonal NoCs. In addition to redundant links, adaptive routing must be used to bypass faulty components. We develop novel fault-tolerant deadlock-free adaptive routing algorithms for these topologies based on the turn model without the use of virtual channels. Our results show that the hexagonal and octagonal NoCs can tolerate all 2-router and 3-router faults, respectively, while the mesh has been shown to tolerate all 1-router faults. To simplify the restricted-turn selection process for achieving deadlock freedom, we devised an approach based on the channel dependency matrix instead of the state-of-the-art Duato's method of observing the channel dependency graph for cycles. The approach is general and can be used for the turn selection process for any regular topology. We further use algebraic manipulations of the channel dependency matrix to analytically assess the fault resilience of the adaptive routing algorithms when affected by permanent faults. We present and validate this method for the 2D mesh and hexagonal NoC topologies achieving very high accuracy with a maximum error of 1%. The approach is very general and allows for faster evaluations as compared to the generally used cycle-accurate simulations. In comparison, existing works usually assume a limited number of faults to be able to analytically assess the network reliability. We apply the approach to evaluate the fault resilience of larger NoCs demonstrating the usefulness of the approach especially compared to cycle-accurate simulations. Finally, we concentrate on temporal and information redundancy techniques to deal with transient and intermittent faults in the router resulting in the dropping and hence loss of packets. Temporal redundancy is applied in the form of ARQ and retransmission of lost packets. Information redundancy is applied by the generation and transmission of redundant linear combinations of packets known as random linear network coding. We develop an analytic model for flexible evaluation of these approaches to determine the network performance parameters such as residual error rates and increased network load. The analytic model allows to evaluate larger NoCs and different topologies and to investigate the advantage of network coding compared to uncoded transmissions. We further extend the work with a small insight to the problem of secure communication over the NoC. Assuming large heterogeneous MPSoCs with components from third parties, the communication is subject to active attacks in the form of packet modification and drops in the NoC routers. Devising approaches to resolve these issues, we again formulate analytic models for their flexible and accurate evaluations, with a maximum estimation error of 7%. info:eu-repo/classification/ddc/621.3 ddc:621.3
284	Securing multi-robot systems with inter-robot observations and accusations Wardega, Kacper Tomasz 24 May 2023 (has links) In various industries, such as manufacturing, logistics, agriculture, defense, search and rescue, and transportation, Multi-robot systems (MRSs) are increasingly gaining popularity. These systems involve multiple robots working together towards a shared objective, either autonomously or under human supervision. However, as MRSs operate in uncertain or even adversarial environments, and the sensors and actuators of each robot may be error-prone, they are susceptible to faults and security threats unique to MRSs. Classical techniques from distributed systems cannot detect or mitigate these threats. In this dissertation, novel techniques are proposed to enhance the security and fault-tolerance of MRSs through inter-robot observations and accusations. A fundamental security property is proposed for MRSs, which ensures that forbidden deviations from a desired multi-robot motion plan by the system supervisor are detected. Relying solely on self-reported motion information from the robots for monitoring deviations can leave the system vulnerable to attacks from a single compromised robot. The concept of co-observations is introduced, which are additional data reported to the supervisor to supplement the self-reported motion information. Co-observation-based detection is formalized as a method of identifying deviations from the expected motion plan based on discrepancies in the sequence of co-observations reported. An optimal deviation-detecting motion planning problem is formulated that achieves all the original application objectives while ensuring that all forbidden plan-deviation attacks trigger co-observation-based detection by the supervisor. A secure motion planner based on constraint solving is proposed as a proof-of-concept to implement the deviation-detecting security property. The security and resilience of MRSs against plan deviation attacks are further improved by limiting the information available to attackers. An efficient algorithm is proposed that verifies the inability of an attacker to stealthily perform forbidden plan deviation attacks with a given motion plan and announcement scheme. Such announcement schemes are referred to as horizon-limiting. An optimal horizon-limiting planning problem is formulated that maximizes planning lookahead while maintaining the announcement scheme as horizon-limiting. Co-observations and horizon-limiting announcements are shown to be efficient and scalable in protecting MRSs, including systems with hundreds of robots, as evidenced by a case study in a warehouse setting. Lastly, the Decentralized Blocklist Protocol (DBP), a method for designing Byzantine-resilient decentralized MRSs, is introduced. DBP is based on inter-robot accusations and allows cooperative robots to identify misbehavior through co-observations and share this information through the network. The method is adaptive to the number of faulty robots and is widely applicable to various decentralized MRS applications. It also permits fast information propagation, requires fewer cooperative observers of application-specific variables, and reduces the worst-case connectivity requirement, making it more scalable than existing methods. Empirical results demonstrate the scalability and effectiveness of DBP in cooperative target tracking, time synchronization, and localization case studies with hundreds of robots. The techniques proposed in this dissertation enhance the security and fault-tolerance of MRSs operating in uncertain and adversarial environments, aiding in the development of secure MRSs for emerging applications. Computer engineering Byzantine fault tolerance Multi-agent systems Multi-robot systems
285	Analyzing and Evaluating the Resilience of Scheduling Scientific Applications on High Performance Computing Systems using a Simulation-based Methodology Sukhija, Nitin 09 May 2015 (has links) Large scale systems provide a powerful computing platform for solving large and complex scientific applications. However, the inherent complexity, heterogeneity, wide distribution, and dynamism of the computing environments can lead to performance degradation of the scientific applications executing on these computing systems. Load imbalance arising from a variety of sources such as application, algorithmic, and systemic variations is one of the major contributors to their performance degradation. In general, load balancing is achieved via scheduling. Moreover, frequently occurring resource failures drastically affect the execution of applications running on high performance computing systems. Therefore, the study of deploying support for integrated scheduling and fault-tolerance mechanisms for guaranteeing that applications deployed on computing systems are resilient to failures becomes of paramount importance. Recently, several research initiatives have started to address the issue of resilience. However, the major focus of these efforts was geared more toward achieving system level resilience with less emphasis on achieving resilience at the application level. Therefore, it is increasingly important to extend the concept of resilience to the scheduling techniques at the application level for establishing a holistic approach that addresses the performability of these applications on high performance computing systems. This can be achieved by developing a comprehensive modeling framework that can be used to evaluate the resiliency of such techniques on heterogeneous computing systems for assessing the impact of failures as well as workloads in an integrated way. This dissertation presents an experimental methodology based on discrete event simulation for the analysis and the evaluation of the resilience of scheduling scientific applications on high performance computing systems. With the aid of the methodology a wide class of dependencies existing between application and computing system are captured within a deterministic model for quantifying the performance impact expected from changes in application and system characteristics. Ideally, the results obtained by employing the proposed simulation-based performance prediction framework enabled an introspective design and investigation of scheduling heuristics to reason about how to best fully optimize various often antagonistic objectives, such as minimizing application makespan and maximizing reliability. framework discrete event simulation performance modeling fault-tolerance reliability makespan heterogeneous computing systems scientific applications resilience
286	Towards Model-Based Fault Management for Computing Systems Jia, Rui 07 May 2016 (has links) Large scale distributed computing systems have been extensively utilized to host critical applications in the fields of national defense, finance, scientific research, commerce, etc. However, applications in distributed systems face the risk of service outages due to inevitable faults. Without proper fault management methods, faults can lead to significant revenue loss and degradation of Quality of Service (QoS). An ideal fault management solution should guarantee fast and accurate fault diagnosis, scalability in distributed systems, portability for a variety of systems, and the versatility of recovering different types of faults. This dissertation presents a model-based fault management structure which automatically recovers computing systems from faults. This structure can recover a system from common faults while minimizing the impact on the system’s QoS. It covers all stages of fault management including fault detection, identification and recovery. It also has the flexibility to incorporate various fault diagnosis methods. When faults occur, the approach identifies fault types and intensity, and it accordingly computes the optimal recovery plan with minimum performance degradation, based on a cost function that defines performance objectives and a predictive control algorithm. The fault management approach has been verified on a centralized Web application testbed and a distributed big data processing testbed with four types of simulated faults: memory leak, network congestion, CPU hog and disk failure. The feasibility of the fault recovery control algorithm is also verified. Simulation results show that our approach enabled effective automatic recovery from faults. Performance evaluation reveals that CPU and memory overhead of the fault management process is negligible. To let domain engineers conveniently apply the proposed fault management structure on their specific systems, a component-based modeling environment is developed. The meta-model of the fault management structure is developed with Unified Modeling Language as an abstract of a general fault recovery solution for computing systems. It defines the fundamental reusable components that comprise such a system, including the connections among them, attributes of each component and constraints. The meta-model can be interpreted into a userriendly graphic modeling environment for creating application models of practical domain specific systems and generating executable codes on them. Component-based Modeling Quality of Service Fault Diagnosis Autonomic Computing Self-healing Systems Fault Tolerance
287	Using Duplication with Compare for On-line Error Detection in FPGA-based Designs McMurtrey, Daniel L. 06 December 2006 (has links) (PDF) Space destined FPGA-based systems must employ redundancy techniques to account for the effects of upsets caused by radiated environments. Error detection techniques can be used to alert external systems to the presence of these upsets. Readback with compare is an error detection technique commonly employed in FPGA-based designs. This work introduces duplication with compare (DWC) as an automated on-line error detection technique that can be used as an alternative to readback with compare. This work also introduces a set of metrics that is used to quantify the effectiveness and coverage of this error detection technique. A tool is presented that automatically inserts duplication with compare into a user's design. Duplication with compare is shown to correctly detect over 99.9% of errors caused by configuration upsets at a hardware cost of approximately 2X. System designers can apply duplication with compare to designs using this tool to increase the reliability and availability of their systems while minimizing resource usage and power. error detection FPGA SEU reliability FPGA reliability soft errors fault tolerance Electrical and Computer Engineering
288	Sustainable Fault-handling Of Reconfigurable Logic Using Throughput-driven Assessment Sharma, Carthik 01 January 2008 (has links) A sustainable Evolvable Hardware (EH) system is developed for SRAM-based reconfigurable Field Programmable Gate Arrays (FPGAs) using outlier detection and group testing-based assessment principles. The fault diagnosis methods presented herein leverage throughput-driven, relative fitness assessment to maintain resource viability autonomously. Group testing-based techniques are developed for adaptive input-driven fault isolation in FPGAs, without the need for exhaustive testing or coding-based evaluation. The techniques maintain the device operational, and when possible generate validated outputs throughout the repair process. Adaptive fault isolation methods based on discrepancy-enabled pair-wise comparisons are developed. By observing the discrepancy characteristics of multiple Concurrent Error Detection (CED) configurations, a method for robust detection of faults is developed based on pairwise parallel evaluation using Discrepancy Mirror logic. The results from the analytical FPGA model are demonstrated via a self-healing, self-organizing evolvable hardware system. Reconfigurability of the SRAM-based FPGA is leveraged to identify logic resource faults which are successively excluded by group testing using alternate device configurations. This simplifies the system architect's role to definition of functionality using a high-level Hardware Description Language (HDL) and system-level performance versus availability operating point. System availability, throughput, and mean time to isolate faults are monitored and maintained using an Observer-Controller model. Results are demonstrated using a Data Encryption Standard (DES) core that occupies approximately 305 FPGA slices on a Xilinx Virtex-II Pro FPGA. With a single simulated stuck-at-fault, the system identifies a completely validated replacement configuration within three to five positive tests. The approach demonstrates a readily-implemented yet robust organic hardware application framework featuring a high degree of autonomous self-control. evolvable hardware fault tolerance group testing organic systems Computer Engineering Engineering
289	Performance and availability trade-offs in fault-tolerant middleware Szentiványi, Diana January 2002 (has links) Distributing functionality of an application is in common use. Systems that are built with this feature in mind also have to provide high levels of dependability. One way of assuring availability of services is to tolerate faults in the system, thereby avoiding failures. Building distributed applications is not an easy task. To provide fault tolerance is even harder. Using middlewares as mediators between hardware and operating systems on one hand and high-level applications on the other hand is a solution to the above difficult problems. It can help application writers by providing automatic generation of code supporting e.g. fault tolerance mechanisms, and by offering interoperability and language independence. For over twenty years, the research community is producing results in the area of . However, experimental studies of different platforms are performed mostly by using made-up simple applications. Also, especially in case of CORBA, there is no fault-tolerant middleware totally conforming to the standard, and well studied in terms of trade-offs. This thesis presents a fault-tolerant CORBA middleware built and evaluated using a realistic application running on top of it. Also, it contains results obtained after experiments with an alternative infrastructure implementing a robust fault-tolerant algorithm using basic CORBA. In the first infrastructure a problem is the existence of single points of failure. On the other hand, overheads and recovery times fall in acceptable ranges. When using the robust algorithm, the problem of single points of failure disappears. The problem here is the memory usage, and overhead values as well as recovery times that can become quite long. / <p>Report code: LiU-TEK-LIC-2002:55.</p> Programming Distributed systems fault tolerance middleware COBRA code supporting Computer Sciences Datavetenskap (datalogi)
290	Pipelined Byzantine Fault Tolerance and Applications Adithya Bhat (17583018) 07 December 2023 (has links) <p dir="ltr">Practically, Byzantine faults are not assumed in cloud applications. Byzantine fault-tolerance adds significant cryptographic, communication, throughput, and latency overheads to applications, contributing to the resistance towards its widespread adoption. Existing Byzantine-fault tolerant protocols focus on optimal latency or optimal communication while ignoring the throughput and cryptographic overheads.</p><p dir="ltr">In this thesis, we explore pipelining for Byzantine fault-tolerant applications. Pipelining tasks is a common optimization in distributed systems that involves executing tasks in stages. The idea is that instead of executing a task in an iteration as an atomic unit, we split the execution into stages and execute all stages of <i>different</i> tasks per iteration. We observe significant performance benefits if executing later stages of a task helps other tasks in earlier stages, saving effort in each stage. The length of the pipeline, i.e., the number of stages, determines the latency of an individual task. However, if the pipeline improves the execution of every stage enough, then the latency improves.</p><p dir="ltr">We primarily explore three Byzantine Fault Tolerant (BFT) applications with pipelining: (i) unique chain-based State Machine Replication protocols: <i>Apollo</i>, <i>Artemis</i>, <i>Leto</i>, and <i>Zeus</i>, and (ii) energy-efficient State Machine Replication: <i>EESMR</i>. (iii) random beacon protocols: <i>GRandPiper</i>, <i>BRandPiper</i>, and <i>OptRand</i>. We design them with a pipeline-first approach to improve the throughput, cryptographic, and communication costs at every stage of the pipeline. With respect to latency, we show (i) pipelined SMR protocols where our pipeline stages have constant cryptographic and linear communication costs allowing our protocols to outperform state-of-the-art BFT-SMR protocols in throughput. (ii) pipelined SMR protocols with techniques to make each stage of the pipeline independent, thus achieving demonstrable energy efficiency while allowing an unbounded number of non-interactive parallel proposals. (iii) reduced latencies for reconfiguration-friendly random beacons by using two pipelines: an SMR pipeline to commit and a beacon pipeline to produce random numbers and decoupling the two pipelines thereby removing the impact of the high-latency SMR pipeline on the latency of the randomness output by the system. </p> Cryptography Distributed systems and algorithms Performance evaluation State Machine Replication Random Beacon Byzantine fault tolerance (BFT)

Search results