Global ETD Search

111	Decentralized Crash-Resilient Runtime Verification Kazemlou, Shokoufeh January 2017 (has links) This is the final revision of my M.Sc. Thesis. / Runtime Verification is a technique to extract information from a running system in order to detect executions violating a given correctness specification. In this thesis, we study distributed synchronous/asynchronous runtime verification of systems. In our setting, there is a set of distributed monitors that have only partial views of a large system and are subject to failures. In this context, it is unavoidable that monitors may have different views of the underlying system, and therefore may have different valuations of the correctness property. In this thesis, we propose an automata-based synchronous monitoring algorithm that copes with f crash failures in a distrbuted setting. The algorithm solves the synchronous monitoring problem in f + 1 rounds of communication, and significantly reduces the message size overhead. We also propose an algorithm for distributed crash-resilient asynchronous monitoring that consistently monitors the system under inspection without any communication between monitors. Each local monitor emits a verdict set solely based on its own partial observation, and the intersection of the verdict sets will be the same as the verdict computed by a centralized monitor that has full view of the system. / Thesis / Master of Science (MSc)
112	A Low-latency Consensus Algorithm for Geographically Distributed Systems Arun, Balaji 15 May 2017 (has links) This thesis presents Caesar, a novel multi-leader Generalized Consensus protocol for geographically replicated systems. Caesar is able to achieve near-perfect availability, provide high performance - low latency and high throughput compared to the existing state-of-the- art, and tolerate replica failures. Recently, a number of state-of-the-art consensus protocols that implement the Generalized Consensus definition have been proposed. However, the major limitation of these existing approaches is the significant performance degradation when application workload produces conflicting requests. Caesar's main goal is to overcome this limitation by changing the way a fast decision is taken: its ordering protocol does not reject a fast decision for a client request if a quorum of nodes reply with different dependency sets for that request. It only switches to a slow decision if there is no chance to agree on the proposed order for that request. Caesar is able to achieve this using a combination of wait condition and logical time stamping. The effectiveness of Caesar is demonstrated through an evaluation study performed on Amazon's EC2 infrastructure using 5 geo-replicated sites. Caesar outperforms other multi-leader (e.g., EPaxos) competitors by as much as 1.7x in presence of 30% conflicting requests, and single-leader (e.g., Multi-Paxos) by as much as 3.5x. The protocol is also resistant to heavy client loads unlike existing protocols. / Master of Science Multi-Leader Consensus State Machine Replication Fault Tolerance Distributed Systems
113	Modeling of Power Consumption and Fault Tolerance for Electronic Textiles Sheikh, Tanwir Abdulwahid 22 October 2003 (has links) The developments in textile technology now enable the weaving of conductive wires into the fabrics. This allows the introduction of electronic components such as sensors, actuators and computational devices on the fabrics, creating electronic textiles (e-textiles). E-textiles can be either wearable or non-wearable. However, regardless of their form, e-textiles are placed in a tightly constrained design space requiring high computational performance, limited power consumption, and fault tolerance. The purpose of this research is to create simulation models for power consumption and fault behavior of e-textile applications. For the power consumption model, the power profile of the computational elements must be tracked dynamically based upon the power states of the e-textile components. For the fault behavior model, the physical nature of the e-textile and the faults developed can adversely affect the accuracy of results from the e-textile. Open and short circuit faults can disconnect or drain the battery respectively, affecting both battery life and the performance of the e-textile. This thesis describes the development of both of these models and their interfaces. It then presents simulation results of the performance of an acoustic beamforming e-textile in the presence and absence of faults, using those results to explore the battery life and fault tolerance of several battery configurations. / Master of Science e-textiles fault tolerance power consumption Electronic Textiles physical modeling
114	Challenges with Providing Reliability Assurance for Self-Adaptive Cyber-Physical Systems Riaz, Sana, Kabir, Sohag, Campean, Felician, Mokryani, Geev, Dao, Cuong, Angarita-Marquez, Jorge L., Al-Ja'afreh, Mohammad A.A. 03 February 2023 (has links) No / Self-adaptive systems are evolving systems that can adjust their behaviour to accommodate dynamic requirements or to better serve the goal. These systems can vary in their architecture, operation, or adaptive strategies based on the application. Moreover, the evaluation can happen in different ways depending on system architecture and its requirements. Self-adaptive systems can be prone to situations like adaptation faults, inconsistencies in context or low performance on tasks due to their dynamism and complexity. That is why it is important to have reliability assurance of the system to monitor such situations which can compromise the system functionality. In this paper, we provide a brief background on different types of self-adaptive systems and various ways a system can evolve. We discuss the different mechanisms that have been applied in the last two decades for reliability evaluation of such systems and identify challenges and limitations as research opportunities related to the self-adaptive system’s reliability evaluation. / This research was undertaken as a part of the “Model-based Reliability Evaluation for Autonomous Systems with Evolving Architectures” project funded by the University of Bradford under the SURE Grant scheme. Reliability Self-adaptive systems Reconfigurable systems Fault tolerance
115	Network Fault Tolerance System Sullivan, John F 01 May 2000 (has links) The world of computers experienced an explosive period of growth toward the end of the 20th century with the widespread availability of the Internet and the development of the World Wide Web. As people began using computer networks for everything from research and communication to banking and commerce, network failures became a greater concern because of the potential to interrupt critical applications. Fault tolerance systems were developed to detect and correct network failures within minutes and eventually within seconds of the failure, but time-critical applications such as military communications, video conferencing, and Web-based sales require better response time than any previous systems could provide. The goal of this thesis was the development and implementation of a Network Fault Tolerance (NFT) system that can detect and recover from failures of network interface cards, network cables, switches, and routers in much less than one second from the time of failure. The problem was divided into two parts: fault tolerance within a single local area network (LAN), and fault tolerance across many local area networks. The first part involves the network interface cards, network cables, and switches within a LAN, which the second part involves the routers that connect LANs into larger internetworks. Both parts of the NFT solution were implemented on Windows NT 4.0 PC's connected by a switched Fast Ethernet network. The NFT system was found to correct system failures within 300 milliseconds of the failure. switch network fault tolerance fault tolerance Fault-tolerant computing Local area networks (Computer networks) Computer networks Failures
116	Adaptive Fault Tolerance Strategies for Large Scale Systems George, Cijo January 2012 (has links) (PDF) Exascale systems of the future are predicted to have mean time between node failures (MTBF) of less than one hour. At such low MTBF, the number of processors available for execution of a long running application can widely vary throughout the execution of the application. Employing traditional fault tolerance strategies like periodic checkpointing in these highly dynamic environments may not be effective because of the high number of application failures, resulting in large amount of work lost due to rollbacks apart from the increased recovery overheads. In this context, it is highly necessary to have fault tolerance strategies that can adapt to the changing node availability and also help avoid significant number of application failures. In this thesis, we present two adaptive fault tolerance strategies that make use of node failure pre-diction mechanisms to provide proactive fault tolerance for long running parallel applications on large scale systems. The first part of the thesis deals with an adaptive fault tolerance strategy for malleable applications. We present ADFT, an adaptive fault tolerance framework for long running malleable applications to maximize application performance in the presence of failures. We first develop cost models that consider different factors like accuracy of node failure predictions and application scalability, for evaluating the benefits of various fault tolerance actions including check-pointing, live-migration and rescheduling. Our adaptive framework then uses the cost models to make runtime decisions for dynamically selecting the fault tolerance actions at different points of application execution to minimize application failures and maximize performance. Simulations with real and synthetic failure traces show that our approach outperforms existing fault tolerance mechanisms for malleable applications yielding up to 23% improvement in work done by the application in the presence of failures, and is effective even for petascale and exascale systems. In the second part of the thesis, we present a fault tolerance strategy using adaptive process replication that can provide fault tolerance for applications using partial replication of a set of application processes. This fault tolerance framework adaptively changes the set of replicated processes (replicated set) periodically based on node failure predictions to avoid application failures. We have developed an MPI prototype implementation, PAREP-MPI that allows dynamically changing the replicated set of processes for MPI applications. Experiments with real scientific applications on real systems have shown that the overhead of PAREP-MPI is minimal. We have shown using simulations with real and synthetic failure traces that our strategy involving adaptive process replication significantly outperforms existing mechanisms providing up to 20% improvement in application efficiency even for exascale systems. Significant observations are also made which can drive future research efforts in fault tolerance for large and very large scale systems. Fault-tolerant Computing Large Scale Systems Adaptive Fault Tolerance Adaptive Process Replication Large Scale Systems - Fault Tolerance Malleability and Rescheduling Large Scale Parallel Systems Proactive Fault Tolerance High Performance Computing Adaptive Fault Management Computer Science
117	Identification of emergent off-nominal operational requirements during conceptual architecting of the more electric aircraft Armstrong, Michael James 09 November 2011 (has links) With the current increased emphasis on the development of energy optimized vehicle systems architectures during the early phases in aircraft conceptual design, accurate predictions of these off-nominal requirements are needed to justify architecture concept selection. A process was developed for capturing architecture specific performance degradation strategies and optimally imposing their associated requirements. This process is enabled by analog extensions to traditional safety design and assessment tools and consists of six phases: Continuous Functional Hazard Assessment, Architecture Definition, Load Shedding Optimization, Analog System Safety Assessment, Architecture Optimization, and Architecture Augmentation. Systematic off-nominal analysis of requirements was performed for dissimilar architecture concepts. It was shown that traditional discrete application of safety and reliability requirements have adverse effects on the prediction of requirements. This design bias was illustrated by cumulative unit importance metrics. Low fidelity representations of the loss/hazard relationship place undue importance on some units and yield under or over-predictions of system performance. Systems architecture Systems engineering Aircraft Systems Design More electric aircraft Reliability Load shedding Fault tolerance Off-design Reliability (Engineering) Fault tolerance (Engineering)
118	A framework for evolving grid computing systems Alfawair, Mai January 2009 (has links) Grid computing was born in the 1990s, when researchers were looking for a way to share expensive computing resources and experiment equipment. Grid computing is becoming increasingly popular because it promotes the sharing of distributed resources that may be heterogeneous in nature, and it enables scientists and engineering professionals to solve large scale computing problems. In reality, there are already huge numbers of grid computing facilities distributed around the world, each one having been created to serve a particular group of scientists such as weather forecasters, or a group of users such as stock markets. However, the need to extend the functionalities of current grid systems lends itself to the consideration of grid evolution. This allows the combination of many disjunct grids into a single powerful grid that can operate as one vast computational resource, as well as for grid environments to be flexible, to be able to change and to evolve. The rationale for grid evolution is the current rapid and increasing advances in both software and hardware. Evolution means adding or removing capabilities. This research defines grid evolution as adding new functions and/or equipment and removing unusable resources that affect the performance of some nodes. This thesis produces a new technique for grid evolution, allowing it to be seamless and to operate at run time. Within grid computing, evolution is an integration of software and hardware and can be of two distinct types, external and internal. Internal evolution occurs inside the grid boundary by migrating special resources such as application software from node to node inside the grid. While external evolution occurs between grids. This thesis develops a framework for grid evolution that insulates users from the complexities of grids. This framework has at its core a resource broker together with a grid monitor to cope with internal and external evolution, advance reservation, fault tolerance, the monitoring of the grid environment, increased resource utilisation and the high availability of grid resources. The starting point for the present framework of grid evolution is when the grid receives a job whose requirements do not exist on the required node which triggers grid evolution. If the grid has all the requirements scattered across its nodes, internal evolution enabling the grid to migrate the required resources to the required node in order to satisfy job requirements ensues, but if the grid does not have these resources, external evolution enables the grid either to collect them from other grids (permanent evolution) or to send the job to other grids for execution (just in time) evolution. Finally a simulation tool called (EVOSim) has been designed, developed and tested. It is written in Oracle 10g and has been used for the creation of four grids, each of which has a different setup including different nodes, application software, data and polices. Experiments were done by submitting jobs to the grid at run time, and then comparing the results and analysing the performance of those grids that use the approach of evolution with those that do not. The results of these experiments have demonstrated that these features significantly improve the performance of grid environments and provide excellent scheduling results, with a decreasing number of rejected jobs. 005.1
119	FAULT-TOLERANT DISTRIBUTED CHANNEL ALLOCATION ALGORITHMS FOR CELLULAR NETWORKS Yang, Jianchang 01 January 2006 (has links) In cellular networks, channels should be allocated efficiently to support communication betweenmobile hosts. In addition, in cellular networks, base stations may fail. Therefore, designing a faulttolerantchannel allocation algorithm is important. That is, the algorithm should tolerate failuresof base stations. Many existing algorithms are neither fault-tolerant nor efficient in allocatingchannels.We propose channel allocation algorithms which are both fault-tolerant and efficient. In theproposed algorithms, to borrow a channel, a base station (or a cell) does not need to get channelusage information from all its interference neighbors. This makes the algorithms fault-tolerant,i.e., the algorithms can tolerate base station failures, and perform well in the presence of thesefailures.Channel pre-allocation has effect on the performance of a channel allocation algorithm. Thiseffect has not been studied quantitatively. We propose an adaptive channel allocation algorithmto study this effect. The algorithm allows a subset of channels to be pre-allocated to cells. Performanceevaluation indicates that a channel allocation algorithm benefits from pre-allocating allchannels to cells.Channel selection strategy also inuences the performance of a channel allocation algorithm.Given a set of channels to borrow, how a cell chooses a channel to borrow is called the channelselection problem. When choosing a channel to borrow, many algorithms proposed in the literaturedo not take into account the interference caused by borrowing the channel to the cells which havethe channel allocated to them. However, such interference should be considered; reducing suchinterference helps increase the reuse of the same channel, and hence improving channel utilization.We propose a channel selection algorithm taking such interference into account.Most channel allocation algorithms proposed in the literature are for traditional cellular networkswith static base stations and the neighborhood relationship among the base stations is fixed.Such algorithms are not applicable for cellular networks with mobile base stations. We proposea channel allocation algorithm for cellular networks with mobile base stations. The proposedalgorithm is both fault-tolerant and reuses channels efficiently.KEYWORDS: distributed channel allocation, resource planning, fault-tolerance, cellular networks,3-cell cluster model.
120	Optimised configuration of sensing elements for control and fault tolerance applied to an electro-magnetic suspension system Michail, Konstantinos January 2009 (has links) New technological advances and the requirements to increasingly abide by new safety laws in engineering design projects highly affects industrial products in areas such as automotive, aerospace and railway industries. The necessity arises to design reduced-cost hi-tech products with minimal complexity, optimal performance, effective parameter robustness properties, and high reliability with fault tolerance. In this context the control system design plays an important role and the impact is crucial relative to the level of cost efficiency of a product. Measurement of required information for the operation of the design control system in any product is a vital issue, and in such cases a number of sensors can be available to select from in order to achieve the desired system properties. However, for a complex engineering system a manual procedure to select the best sensor set subject to the desired system properties can be very complicated, time consuming or even impossible to achieve. This is more evident in the case of large number of sensors and the requirement to comply with optimum performance. The thesis describes a comprehensive study of sensor selection for control and fault tolerance with the particular application of an ElectroMagnetic Levitation system (being an unstable, nonlinear, safety-critical system with non-trivial control performance requirements). The particular aim of the presented work is to identify effective sensor selection frameworks subject to given system properties for controlling (with a level of fault tolerance) the MagLev suspension system. A particular objective of the work is to identify the minimum possible sensors that can be used to cover multiple sensor faults, while maintaining optimum performance with the remaining sensors. The tools employed combine modern control strategies and multiobjective constraint optimisation (for tuning purposes) methods. An important part of the work is the design and construction of a 25kg MagLev suspension to be used for experimental verification of the proposed sensor selection frameworks. 629.04

Search results