Global ETD Search

31	Robust, High-Speed Network Design for Large-Scale Multiprocessing DeHon, Andre 01 September 1993 (has links) As multiprocessor system size scales upward, two important aspects of multiprocessor systems will generally get worse rather than better: (1) interprocessor communication latency will increase and (2) the probability that some component in the system will fail will increase. These problems can prevent us from realizing the potential benefits of large-scale multiprocessing. In this report we consider the problem of designing networks which simultaneously minimize communication latency while maximizing fault tolerance. Using a synergy of techniques including connection topologies, routing protocols, signalling techniques, and packaging technologies we assemble integrated, system-level solutions to this network design problem. fault tolerance multipath network latency smultiprocessing transit
32	Otherworld - Giving Applications a Chance to Survive OS Kernel Crashes Depoutovitch, Alexandre 06 January 2012 (has links) The default behavior of all commodity operating systems today is to restart the system when a critical error is encountered in the kernel. This terminates all running applications with an attendant loss of "work in progress" that is non-persistent. Our thesis is that an operating system kernel is simply a component of a larger software system, which is logically well isolated from other components, such as applications, and therefore it should be possible to reboot the kernel without terminating everything else running on the same system. In order to prove this thesis, we designed and implemented a new mechanism, called Otherworld, that microreboots the operating system kernel when a critical error is encountered in the kernel, and it does so without clobbering the state of the running applications. After the kernel microreboot, Otherworld attempts to resurrect the applications that were running at the time of failure. It does so by restoring the application memory spaces, open files and other resources. In the default case it then continues executing the processes from the point at which they were interrupted by the failure. Optionally, applications can have user-level recovery procedures registered with the kernel, in which case Otherworld passes control to these procedures after having restored their process state. Recovery procedures might check the integrity of application data and restore resources Otherworld was not able to restore. We implemented Otherworld in Linux, but we believe that the technique can be applied to all commodity operating systems. In an extensive set of experiments on real-world applications (MySQL, Apache/PHP, Joe, vi), we show that Otherworld is capable of successfully microrebooting the kernel and restoring the applications in over 97\% of the cases. In the default case, Otherworld adds negligible overhead to normal execution. In an enhanced mode, Otherworld can provide extra application memory protection with overhead of between 4% and 12%. Operating Systems Reliabiltiy Fault Tolerance Microreboot 0984
33	Otherworld - Giving Applications a Chance to Survive OS Kernel Crashes Depoutovitch, Alexandre 06 January 2012 (has links) The default behavior of all commodity operating systems today is to restart the system when a critical error is encountered in the kernel. This terminates all running applications with an attendant loss of "work in progress" that is non-persistent. Our thesis is that an operating system kernel is simply a component of a larger software system, which is logically well isolated from other components, such as applications, and therefore it should be possible to reboot the kernel without terminating everything else running on the same system. In order to prove this thesis, we designed and implemented a new mechanism, called Otherworld, that microreboots the operating system kernel when a critical error is encountered in the kernel, and it does so without clobbering the state of the running applications. After the kernel microreboot, Otherworld attempts to resurrect the applications that were running at the time of failure. It does so by restoring the application memory spaces, open files and other resources. In the default case it then continues executing the processes from the point at which they were interrupted by the failure. Optionally, applications can have user-level recovery procedures registered with the kernel, in which case Otherworld passes control to these procedures after having restored their process state. Recovery procedures might check the integrity of application data and restore resources Otherworld was not able to restore. We implemented Otherworld in Linux, but we believe that the technique can be applied to all commodity operating systems. In an extensive set of experiments on real-world applications (MySQL, Apache/PHP, Joe, vi), we show that Otherworld is capable of successfully microrebooting the kernel and restoring the applications in over 97\% of the cases. In the default case, Otherworld adds negligible overhead to normal execution. In an enhanced mode, Otherworld can provide extra application memory protection with overhead of between 4% and 12%. Operating Systems Reliabiltiy Fault Tolerance Microreboot 0984
34	RADIC: a powerful fault-tolerant architecture Amancio Duarte, Angelo 28 June 2007 (has links) La tolerancia a fallos se ha convertido en un requerimiento importante para los ingenieros informáticos y los desarrolladores de software, debido a que la ocurrencia de fallos aumenta el coste de explotación de un computador paralelo. Por otro lado, las actividades realizadas por el mecanismo de tolerancia de fallo reducen las prestaciones del sistema desde el punto de vista del usuario. Esta tesis presenta una arquitectura tolerante a fallos para computadores paralelos, denominada RADIC (Redundant Array of Distributed Fault Tolerance Controllers,), que es simultáneamente transparente, descentralizada, flexible y escalable. RADIC es una arquitectura tolerante a fallos que se basa un controlador distribuido para manejar los fallos. Dicho controlador se basa en procesos dedicados, que comparten los recursos del usuario en el computador paralelo. Para validar el funcionamiento de la arquitectura RADIC, se realizó una implementación que sigue el estándar MPI-1 y que contiene los elementos de la arquitectura. Dicha implementación, denominada RADICMPI, permite verificar la funcionalidad de RADIC en situaciones sin fallo o bajo condiciones de fallo. Las pruebas se han realizado utilizando un inyector de fallos, involucrado en el código de RADICMPI, de manera que permite todas las condiciones necesarias para validar la operación del controlador distribuido de RADIC. También se utilizó la misma implementación para estudiar las consecuencias de usar RADIC en un ambiente real. Esto permitió evaluar la operación de la arquitectura en situaciones prácticas, y estudiar la influencia de los parámetros de RADIC sobre el funcionamiento del sistema. Los resultados probaron que la arquitectura de RADIC funciona correctamente y que es flexible, escalable, transparente y descentralizada. Además, RADIC estableció una arquitectura de tolerancia a fallos para sistemas basados en paso de mensajes. / Fault tolerance has become a major issue for computer engineers and software developers because the occurrence of faults increases the cost of using a parallel computer. On the other hand, the activities performed by the fault tolerance mechanism reduce the performance of the system from the user point of view. This thesis presents RADIC (Redundant Array of Distributed Independent Fault Tolerance Controllers,) a fault-tolerant architecture to parallel computers, which is simultaneously transparent, decentralized, flexible and scalable. RADIC is a fault-tolerant architecture that implements a fully distributed controller to manage faults. Such controller rests on dedicated processes, which share the user's resources in the parallel computer. In order to validate the operation of RADIC, we created RADICMPI, a message-passing implementation that includes the elements of the RADIC architecture and complies with the MPI-1 standard. RADICMPI served for to verifying the functionality of RADIC in scenarios with and without failures in the parallel computer. For the tests, we implemented a fault injector in RADICMPI in order to create the scenarios required to validate the operation of the RADIC distributed controller. We also used RADICMPI to study the practical aspects of using RADIC in a real environment. This allowed us to evaluate the operation of our architecture in practical situations, and to study the influence of the RADIC parameters over the system performance. The results proved that the RADIC architecture operated correctly and that it is flexible, scalable, transparent and decentralized. Furthermore, RADIC established a powerful fault-tolerant architecture model for message-passing systems. Message passing Cluster Fault tolerance Tecnologies 68
35	Multipath Fault-tolerant Routing Policies to deal with Dynamic Link Failures in High Speed Interconnection Networks Zarza, Gonzalo Alberto 08 July 2011 (has links) Les xarxes d'interconnexió tenen com un dels seus objectius principals comunicar i enllaçar els nodes de processament dels sistemes de còmput d'altes prestacions. En aquest context, les fallades de xarxa tenen un impacte considerablement alt, ja que la majoria dels algorismes d'encaminament no van ser dissenyats per tolerar aquestes anomalies. A causa d'això, fins i tot una única fallada d'enllaç té la capacitat d'embussar missatges a la xarxa, provocant situacions de bloqueig o, encara pitjor, és capaç d'impedir la correcta finalització de les aplicacions que es trobin en execució en el sistema de còmput. En aquesta tesi presentem polítiques d'encaminament tolerants a fallades basades en els conceptes d'adaptabilitat i evitació de bloquejos, dissenyades per a xarxes afectades per un gran nombre de fallades d'enllaços. Es presenten dues contribucions al llarg de la tesi, a saber: un mètode d'encaminament tolerant a fallades multicamí, i una tècnica nova i escalable d'evitació de bloquejos. La primera de les contribucions de la tesi és un algorisme d'encaminament adaptatiu multicamí, anomenat Fault-tolerant Distributed Routing Balancing (FT-DRB), que permet explotar la redundància de camins de comunicació de les topologies de xarxa actuals, a fi de proveir tolerància a fallades a les xarxes d'interconnexió. La segona contribució de la tesi és la tècnica escalable d'evitació de bloquejos Non-blocking Adaptive Cycles (NAC). Aquesta tècnica va ser específicament dissenyada per funcionar en xarxes d'interconnexió que presentin un gran nombre de fallades d'enllaços. Aquesta tècnica va ser dissenyada i implementada amb la finalitat de servir al mètode d'encaminament descrit anteriorment, FT-DRB. / Las redes de interconexión tienen como uno de sus objetivos principales comunicar y enlazar los nodos de procesamiento de los sistemas de cómputo de altas prestaciones. En este contexto, los fallos de red tienen un impacto considerablemente alto, ya que la mayoría de los algoritmos de encaminamiento no fueron diseñados para tolerar dichas anomalías. Debido a esto, incluso un único fallo de en un enlace tiene la capacidad de atascar mensajes en la red, provocando situaciones de bloqueo o, peor aún, es capaz de impedir la correcta finalización de las aplicaciones que se encuentren en ejecución en el sistema de cómputo. En esta tesis presentamos políticas de encaminamiento tolerantes a fallos basadas en los conceptos de adaptabilidad y evitación de bloqueos, diseñadas para redes de comunicación afectadas por un gran número de fallos de enlaces. Se presentan dos contribuciones a lo largo de la tesis, a saber: un método de encaminamiento tolerante a fallos multicamino, y una novedosa y escalable técnica de evitación de bloqueos. La primera de las contribuciones de la tesis es un algoritmo de encaminamiento adaptativo multicamino, denominado Fault-tolerant Distributed Routing Balancing (FT-DRB), que permite explotar la redundancia de caminos de comunicación de las topologías de red actuales, a fin de proveer tolerancia a fallos a las redes de interconexión. La segunda contribución de la tesis es la técnica escalable de evitación de bloqueos Non-blocking Adaptive Cycles (NAC). Dicha técnica fue específicamente diseñada para funcionar en redes de interconexión que presenten un gran número de fallos de enlaces. Esta técnica fue diseñada e implementada con la finalidad de servir al método de encaminamiento descrito anteriormente, FT-DRB. / Interconnection networks communicate and link together the processing units of modern high-performance computing systems. In this context, network faults have an extremely high impact since most routing algorithms have not been designed to tolerate faults. Because of this, as few as one single link failure may stall messages in the network, leading to deadlock configurations or, even worse, prevent the finalization of applications running on computing systems. In this thesis we present fault-tolerant routing policies based on concepts of adaptability and deadlock freedom, capable of serving interconnection networks affected by a large number of link failures. Two contributions are presented throughout this thesis, namely: a multipath fault-tolerant routing method, and a novel and scalable deadlock avoidance technique. The first contribution of this thesis is the adaptive multipath routing method Fault-tolerant Distributed Routing Balancing (FT-DRB). This method has been designed to exploit the communication path redundancy available in many network topologies, allowing interconnection networks to perform in the presence of a large number of faults. The second contribution is the scalable deadlock avoidance technique Non-blocking Adaptive Cycles (NAC), specifically designed for interconnection networks suffering from a large number of failures. This technique has been designed and implemented with the aim of ensuring freedom from deadlocks in the proposed fault-tolerant routing method FT-DRB. Networks Fault tolerance Routing Tecnologies 004
36	Analyzing IP/MPLS as Fault Tolerant Network Architecture Kebria, Muhammad Roohan January 2012 (has links) MPLS is a widely used technology in the service providers and enterprise networks across the globe. MPLS-enabled infrastructure has the ability to transport any type of payload (ATM, Frame Relay and Ethernet) over it, subsequently providing a multipurpose architecture. An incoming packet is classified only once as it enters into the MPLS domain and gets assigned label information; thereafter all decision processes along a specified path is based upon the attached label rather than destination IP addresses. As network applications are becoming mission critical, the requirements for fault tolerant networks are increasing, as a basic requirement for carrying sensitive traffic. Fault tolerance mechanisms as provided by an IP/MPLS network helps in providing end to end “Quality of Service” within a domain, by better handling blackouts and brownouts. This thesis work reflects how MPLS increases the capability of deployed IP infrastructure to transport traffic in-between end devices with unexpected failures in place. It also focuses on how MPLS converts a packet switched network into a circuit switched network, while retaining the characteristics of packet switched technology. A new mechanism for MPLS fault tolerance is proposed. Fault Tolerance IP MPLS Network Routing Switching
37	A Skeleton Supporting Group Collaboration, Load Distribution, and Fault Tolerance for Internet-based Computing Chiang, Chuanwen 13 August 2001 (has links) This dissertation is intended to explore the design of a dual connection skeleton (DCS), which facilitates effective and efficient exploitation of Internet-centric collaborative workgroup and high performance metacomputing applications. The predominant difference between DCS and conventional frameworks is that DCS administers a network of brokers that are grouped into a logical ring. New mechanisms for group collaboration, load distribution, and fault tolerance, which are three crucial issues in Internet-based computing, are proposed and integrated into the dual connection skeleton. Collaborative workgroup becomes a significant common issue when we attempt to develop wide area applications supporting computer-supported cooperative work (CSCW). For group collaboration, DCS therefore offers a strategy for concurrency control that ensures the consistency of shared resources. By using the strategy, multiple users in a collaborative group are able to simultaneously access shared data without violating its consistency. With respect to load distribution, additionally, DCS applies an adaptive highest response ratio next (AHRRN) algorithm to job scheduling. Performance evaluations on competing algorithms, such as shortest job first (SJF), highest response ratio next (HRRN), and first come, first served (FCFS) are conducted. Simulation results demonstrate that AHRRN is not only an efficient algorithm, but also is able to prevent the well-known job starvation problem. In a parallel computational application, one can further decompose a composite job into constituent tasks such that these tasks can be assigned to different PEs for concurrent execution. The dual connection skeleton thus makes use of a proposed dynamic grouping scheduling (DGS), to undertake task scheduling for performance improvement. The DGS algorithm employs a task grouping strategy to determine computational costs of tasks. It re-prioritizes unscheduled tasks at each scheduling step to explore an appropriate task allocation decision. In terms of the schedule length, the performance of DGS has been evaluated by comparing with some existing algorithms, such as Heavy Node First (HNF), Critical Path Method (CPM), Weight Length (WL), Dynamic Level Scheduling (DLS), and Dynamic Priority Scheduling (DPS). Simulation results show that DGS outperforms these competing algorithms. Moreover, as for fault tolerance, DCS utilizes a dual connection mechanism for computational reliability enhancement. For the sake of constructing dual connection, we examine five approaches: RANDOM, NEXT, ROTARY, MINNUM, and WEIGHT. Each one of these approaches can be incorporated into DCS-based wide-area metacomputing systems. Performance simulation shows that WEIGHT benefits the dual connection the most. A DCS-based scientific computational application named the motion correction is used to demonstrate the fault tolerant ability of DCS. Putting the group collaboration, load distribution, and fault tolerance issues together, the dual connection skeleton forms a seamless and integrated framework for Internet-centric computing. Group Collaboration Load Distribution Internet Fault Tolerance
38	On strong fault tolerance (or strong Menger-connectivity) of multicomputer networks Oh, Eunseuk 15 November 2004 (has links) As the size of networks increases continuously, dealing with networks with faulty nodes becomes unavoidable. In this dissertation, we introduce a new measure for network fault tolerance, the strong fault tolerance (or strong Menger-connectivity)in multicomputer networks, and study the strong fault tolerance for popular multicomputer network structures. Let G be a network in which all nodes have degree d. We say that G is strongly fault tolerant if it has the following property: Let Gf be a copy of G with at most d - 2 faulty nodes. Then for any pair of non-faulty nodes u and v in Gf , there are min{degf (u), degf (v)} node-disjoint paths in Gf from u to v, where degf (u) and degf (v) are the degrees of the nodes u and v in Gf, respectively. First we study the strong fault tolerance for the popular network structures such as star networks and hypercube networks. We show that the star networks and the hypercube networks are strongly fault tolerant and develop efficient algorithms that construct the maximum number of node-disjoint paths of nearly optimal or optimal length in these networks when they contain faulty nodes. Our algorithms are optimal in terms of their time complexity. In addition to studying the strong fault tolerance, we also investigate a more realistic concept to describe the ability of networks for tolerating faults. The traditional definition of fault tolerance, sustaining at most d - 1faulty nodes for a regular graph G of degree d, reflects a very rare situation. In many cases, there is a chance that a routing path between two given nodes can be constructed though the network may have more faulty nodes than its degree. In this dissertation, we study the fault tolerance of hypercube networks under a probability model. When each node of the n-dimensional hypercube network has an independent failure probability p, we develop algorithms that, with very high probability, can construct a fault-free path when the hypercube network can sustain up to 2np faulty nodes. strong fault tolerance star networks hypercube networks
39	Failure recovery in redundant serial manipulators / Cocca, Christopher David, January 2000 (has links) Thesis (Ph. D.)--University of Texas at Austin, 2000. / Vita. Includes bibliographical references (leaves 214-223). Available also in a digital version from Dissertation Abstracts.
40	Fusion-based Hadoop MapReduce job for fault tolerance in distributed systems Ho, Iat-Kei 09 December 2013 (has links) Standard recovery solution on a failed task in Hadoop systems is to execute the task again. After retrying for a configured number of times, it is marked as failure. With significant amount of data, complicated Map and Reduce functions, recovering corrupted or unfinished data from a failed job can be more efficient than re-executing the same job. This paper is an extension of [1] by applying fusion-based technique [7][8] in Hadoop MapReduce tasks execution to enhance its fault tolerance. Multiple data sets are executed through Hadoop MapReduce with and without fusion in various pre-defined failure scenarios for comparison. As the complexity of the Map and Reduce function relative to the Recover function increases, it becomes more efficient to utilize fusion and users can tolerate faults by incurring less than ten percent of extra execution time. / text Fusion Hadoop Map Reduce Fault tolerance

Search results