Global ETD Search

1	Reliability of space systems: a multi-element integrated approach Baker, Gena Humphrey 01 October 2000 (has links) No description available. Computer systems -- Reliability Reliability (Engineering) Engineering Industrial Engineering
2	Scheduling for composite event detection in wireless sensor networks Unknown Date (has links) Wireless sensor networks are used in areas that are inaccessible, inhospitable or for continuous monitoring. The main use of such networks is for event detection. Event detection is used to monitor a particular environment for an event such as fire or flooding. Composite event detection is used to break down the detection of the event into the specific conditions that need to be present for the event to occur. Using this method, each sensor node does not need to carry every sensing component necessary to detect the event. Since energy efficiency is important the sensor nodes need to be scheduled so that they consume [sic] consume as little energy as possible to extend the network lifetime. In this thesis, a solution to the sensor Scheduling for Composite Event Detection (SCED) problem will be presented as a way to improve the network lifetime when using composite event detection. / by Arny I. Ambrose. / Thesis (M.S.C.S.)--Florida Atlantic University, 2008. / Includes bibliography. / Electronic reproduction. Boca Raton, Fla., 2008. Mode of access: World Wide Web. Sensor networks Wireless communication systems Embedded computer systems Computer systems--Reliability
3	Improving System Reliability for Cyber-Physical Systems Wu, Leon L. January 2015 (has links) Cyber-physical systems (CPS) are systems featuring a tight combination of, and coordination between, the system’s computational and physical elements. Cyber-physical systems include systems ranging from critical infrastructure such as a power grid and transportation system to health and biomedical devices. System reliability, i.e., the ability of a system to perform its intended function under a given set of environmental and operational conditions for a given period of time, is a fundamental requirement of cyber-physical systems. An unreliable system often leads to disruption of service, financial cost and even loss of human life. An important and prevalent type of cyber-physical system meets the following criteria: processing large amounts of data; employing software as a system component; running online continuously; having operator-in-the-loop because of human judgment and an accountability requirement for safety critical systems. This thesis aims to improve system reliability for this type of cyber-physical system. To improve system reliability for this type of cyber-physical system, I present a system evaluation approach entitled automated online evaluation (AOE), which is a data-centric runtime monitoring and reliability evaluation approach that works in parallel with the cyber-physical system to conduct automated evaluation along the workflow of the system continuously using computational intelligence and self-tuning techniques and provide operator-in-the-loop feedback on reliability improvement. For example, abnormal input and output data at or between the multiple stages of the system can be detected and flagged through data quality analysis. As a result, alerts can be sent to the operator-in-the-loop. The operator can then take actions and make changes to the system based on the alerts in order to achieve minimal system downtime and increased system reliability. One technique used by the approach is data quality analysis using computational intelligence, which applies computational intelligence in evaluating data quality in an automated and efficient way in order to make sure the running system perform reliably as expected. Another technique used by the approach is self-tuning which automatically self-manages and self-configures the evaluation system to ensure that it adapts itself based on the changes in the system and feedback from the operator. To implement the proposed approach, I further present a system architecture called autonomic reliability improvement system (ARIS). This thesis investigates three hypotheses. First, I claim that the automated online evaluation empowered by data quality analysis using computational intelligence can effectively improve system reliability for cyber-physical systems in the domain of interest as indicated above. In order to prove this hypothesis, a prototype system needs to be developed and deployed in various cyber-physical systems while certain reliability metrics are required to measure the system reliability improvement quantitatively. Second, I claim that the self-tuning can effectively self-manage and self-configure the evaluation system based on the changes in the system and feedback from the operator-in-the-loop to improve system reliability. Third, I claim that the approach is efficient. It should not have a large impact on the overall system performance and introduce only minimal extra overhead to the cyber- physical system. Some performance metrics should be used to measure the efficiency and added overhead quantitatively. Additionally, in order to conduct efficient and cost-effective automated online evaluation for data-intensive CPS, which requires large volumes of data and devotes much of its processing time to I/O and data manipulation, this thesis presents COBRA, a cloud-based reliability assurance framework. COBRA provides automated multi-stage runtime reliability evaluation along the CPS workflow using data relocation services, a cloud data store, data quality analysis and process scheduling with self-tuning to achieve scalability, elasticity and efficiency. Finally, in order to provide a generic way to compare and benchmark system reliability for CPS and to extend the approach described above, this thesis presents FARE, a reliability benchmark framework that employs a CPS reliability model, a set of methods and metrics on evaluation environment selection, failure analysis, and reliability estimation. The main contributions of this thesis include validation of the above hypotheses and empirical studies of ARIS automated online evaluation system, COBRA cloud-based reliability assurance framework for data-intensive CPS, and FARE framework for benchmarking reliability of cyber-physical systems. This work has advanced the state of the art in the CPS reliability research, expanded the body of knowledge in this field, and provided some useful studies for further research. Cooperating objects (Computer systems) Computer systems--Reliability Computer systems--Evaluation Computer science Information science
4	Detecting and tolerating faults in distributed systems Ogale, Vinit Arun, 1979- 05 October 2012 (has links) This dissertation presents techniques for detecting and tolerating faults in distributed systems. Detecting faults in distributed or parallel systems is often very difficult. We look at the problem of determining if a property or assertion was true in the computation. We formally define a logic called BTL that can be used to define such properties. Our logic takes temporal properties in consideration as these are often necessary for expressing conditions like safety violations and deadlocks. We introduce the idea of a basis of a computation with respect to a property. A basis is a compact and exact representation of the states of the computation where the property was true. We exploit the lattice structure of the computation and the structure of different types of properties and avoid brute force approaches. We have shown that it is possible to efficiently detect all properties that can be expressed by using nested negations, disjunctions, conjunctions and the temporal operators possibly and always. Our algorithm is polynomial in the number of processes and events in the system, though it is exponential in the size of the property. After faults are detected, it is necessary to act on them and, whenever possible, continue operation with minimal impact. This dissertation also deals with designing systems that can recover from faults. We look at techniques for tolerating faults in data and the state of the program. Particularly, we look at the problem where multiple servers have different data and program state and all of these need to be backed up to tolerate failures. Most current approaches to this problem involve some sort of replication. Other approaches based on erasure coding have high computational and communication overheads. We introduce the idea of fusible data structures to back up data. This approach relies on the inherent structure of the data to determine techniques for combining multiple such structures on different servers into a single backup data structure. We show that most commonly used data structures like arrays, lists, stacks, queues, and so on are fusible and present algorithms for this. This approach requires less space than replication without increasing the time complexities for any updates. In case of failures, data from the back up and other non-failed servers is required to recover. To maintain program state in case of failures, we assume that programs can be represented by deterministic finite state machines. Though this approach may not yet be practical for large programs it is very useful for small concurrent programs like sensor networks or finite state machines in hardware designs. We present the theory of fusion of state machines. Given a set of such machines, we present a polynomial time algorithm to compute another set of machines which can tolerate the required number of faults in the system. / text Fault-tolerant computing System design Computer systems--Reliability
5	Adaptable stateful application server replication Wu, Huaigu, 1975- January 2008 (has links) In recent years, multi-tier architectures have become the standard computing environment for web- and enterprise applications. The application server tier is often the heart of the system embedding the business logic. Adaptability, in particular the capability to adjust to the load submitted to the system and to handle the failure of individual components, are of outmost importance in order to provide 7/24 access and high performance. Replication is a common means to achieve these reliability and scalability requirements. With replication, the application server tier consists of several server replicas. Thus, if one replica fails, others can take over. Furthermore, the load can be distributed across the available replicas. Although many replication solutions have been proposed so far, most of them have been either developed for fault-tolerance or for scalability. Furthermore, only few have considered that the application server tier is only one tier in a multi-tier architecture, that this tier maintains state, and that execution in this environment can follow complex patterns. Thus, existing solutions often do not provide correctness beyond some basic application scenarios. / In this thesis we tackle the issue of replication of the application server tier from ground off and develop a unified solution that provides both fault-tolerance and scalability. We first describe a set of execution patterns that describe how requests are typically executed in multi-tier architectures. They consider the flow of execution across client tier, application server tier, and database tier. In particular, the execution patterns describe how requests are associated with transactions, the fundamental execution units at application server and database tiers. Having these execution patterns in mind, we provide a formal definition of what it means to provide a correct execution across all tiers, even in case failures occur and the application server tier is replicated. Informally, a replicated system is correct if it behaves exactly as a non-replicated that never fails. From there, we propose a set of replication algorithms for fault-tolerance that provide correctness for the execution patterns that we have identified The main principle is to let a primary AS replica to execute all client requests, and to propagate any state changes performed by a transaction to backup replicas at transaction commit time. The challenges occur as requests can be associated in different ways with transactions. Then, we extend our fault-tolerance solution and develop a unified solution that provides both fault-tolerance and load-balancing. In this extended solution, each application server replica is able to execute client requests as a primary and at the same time serves as backup for other replicas. The framework provides a transparent, truly distributed and lightweight load distribution mechanism which takes advantage of the fault-tolerance infrastructure. Our replication tool is implemented as a plug-in of JBoss application server and the performance is carefully evaluated, comparing with JBoss' own replication solutions. The evaluation shows that our protocols have very good performance and compare favorably with existing solutions. Client/server computing. Internet programming. Computer systems -- Reliability. Fault-tolerant computing.
6	Low-cost and efficient architectural support for correctness and performance debugging Venkataramani, Guru Prasadh V. January 2009 (has links) Thesis (Ph.D)--Computing, Georgia Institute of Technology, 2010. / Committee Chair: Prvulovic, Milos; Committee Member: Hughes, Christopher J.; Committee Member: Kim, Hyesoon; Committee Member: Lee, Hsien-Hsin S.; Committee Member: Loh, Gabriel H. Part of the SMARTech Electronic Thesis and Dissertation Collection.
7	Adaptable stateful application server replication Wu, Huaigu, 1975- January 2008 (has links) No description available. Client/server computing. Computer systems -- Reliability. Internet programming. Fault-tolerant computing.
8	Low-cost and efficient architectural support for correctness and performance debugging Venkataramani, Guru Prasadh V. 15 July 2009 (has links) With rapid growth in computer hardware technologies and architectures, software programs have become increasingly complex and error-prone. This software complexity has resulted in program crashes and even security threats. Correctness Debugging is making sure that the program does not exhibit any unintended behavior at runtime. A fully correct program without good performance does not lend any commercial success to the software product. Performance Debugging ensures good performance on hardware platforms. A number of prior debugging solutions either suffer from huge performance overheads or incur high implementation costs. We propose low-cost and efficient hardware solutions that target three specific correctness and performance problems, namely, memory debugging, taint propagation and comprehensive cache miss classification. Experiments show that our mechanisms incur low performance overheads and can be designed with minimal changes to existing processor hardware. While architects invest time and resources into designing high-end architectures, we show that it is equally important to incorporate useful debugging features into these processors in order to enhance the ease of use for programmers. Scalability in multi-cores Debugging Computer architecture Computer programs Correctness Debugging in computer science Computer systems Reliability Computer architecture
9	Hardware assisted memory checkpointing and applications in debugging and reliability Doudalis, Ioannis 25 July 2011 (has links) The problems of software debugging and system reliability/availability are among the most challenging problems the computing industry is facing today, with direct impact on the development and operating costs of computing systems. A promising debugging technique that assists programmers identify and fix the causes of software bugs a lot more efficiently is bidirectional debugging, which enables the user to execute the program in "reverse", and a typical method used to recover a system after a fault is backwards error recovery, which restores the system to the last error-free state. Both reverse execution and backwards error recovery are enabled by creating memory checkpoints, which are used to restore the program/system to a prior point in time and re-execute until the point of interest. The checkpointing frequency is the primary factor that affects both the latency of reverse execution and the recovery time of the system; more frequent checkpoints reduce the necessary re-execution time. Frequent creation of checkpoints poses performance challenges, because of the increased number of memory reads and writes necessary for copying the modified system/program memory, and also because of software interventions, additional synchronization and I/O, etc., needed for creating a checkpoint. In this thesis I examine a number of different hardware accelerators, whose role is to create frequent memory checkpoints in the background, at minimal performance overheads. For the purpose of reverse execution, I propose the HARE and Euripus hardware checkpoint accelerators. HARE and Euripus create different types of checkpoints, and employ different methods for keeping track of the modified memory. As a result, HARE and Euripus have different hardware costs and provide different functionality which directly affects the latency of reverse execution. For improving the availability of the system, I propose the Kyma hardware accelerator. Kyma enables simultaneous creation of checkpoints at different frequencies, which allows the system to recover from multiple types of errors and tolerate variable error-detection latencies. The Kyma and Euripus hardware engines have similar architectures, but the functionality of the Kyma engine is optimized for further reducing the performance overheads and improving the reliability of the system. The functionality of the Kyma and Euripus engines can be combined into a unified accelerator that can serve the needs of both bidirectional debugging and system recovery. Memory checkpointing Hardware accelerators Computer systems reliability Computer architecture Debugging Debugging in computer science Memory management (Computer science) Software maintenance
10	Towards Manifesting Reliability Issues In Modern Computer Systems Zheng, Mai 02 September 2015 (has links) No description available. Computer Science

Search results