Spelling suggestions: "subject:"multiprocessor"" "subject:"coprocessor""
1 |
Checkpointing Algorithms for Parallel ComputersKalaiselvi, S 02 1900 (has links)
Checkpointing is a technique widely used in parallel/distributed computers for rollback error recovery. Checkpointing is defined as the coordinated saving of process state information at specified time instances. Checkpoints help in restoring the computation from the latest saved state, in case of failure. In addition to fault recovery, checkpointing has applications in fault detection, distributed debugging and process migration.
Checkpointing in uniprocessor systems is easy due to the fact that there is a single clock and events occur with respect to this clock. There is a clear demarcation of events that happens before a checkpoint and events that happens after a checkpoint. In parallel computers a large number of computers coordinate to solve a single problem. Since there might be multiple streams of execution, checkpoints have to be introduced along all these streams simultaneously. Absence of a global clock necessitates explicit coordination to obtain a consistent global state.
Events occurring in a distributed system, can be ordered partially using Lamport's happens before relation. Lamport's happens before relation ->is a partial ordering relation to identify dependent and concurrent events occurring in a distributed system.
It is defined as follows:
·If two events a and b happen in the same process, and if a happens before b, then a->b
·If a is the sending event of a message and b is the receiving event of the same message then a -> b
·If neither a à b nor b -> a, then a and b are said to be concurrent.
A consistent global state may have concurrent checkpoints. In the first chapter of the thesis we discuss issues regarding ordering of events in a parallel computer, need for coordination among checkpoints and other aspects related to checkpointing. Checkpointing locations can either be identified statically or dynamically. The static approach assumes that a representation of a program to be checkpointed is available with information that enables a programmer to specify the places where checkpoints are to be taken. The dynamic approach identifies the checkpointing locations at run time. In this thesis, we have proposed algorithms for both static and dynamic checkpointing. The main contributions of this thesis are as follows:
1. Parallel computers that are being built now have faster communication and hence more efficient clock synchronisation compared to those built a few years ago. Based on efficient clock synchronisation protocols, the clock drift in current machines can be maintained within a few microseconds. We have proposed a dynamic checkpointing algorithm for parallel computers assuming bounded clock drifts.
2. The shared memory paradigm is convenient for programming while message passing paradigm is easy to scale. Distributed Shared Memory (DSM) systems combine the advantage of both paradigms and can be visualized easily on top of a network of workstations. IEEE has recently proposed an interconnect standard called Scalable Coherent Interface (SCI), to con6gure computers as a Distributed Shared Memory system. A periodic dynamic checkpointing algorithm has been proposed in the thesis for a DSM system which uses the SCI standard.
3. When information about a parallel program is available one can make use of this knowledge to perform efficient checkpointing. A static checkpointing approach based on task graphs is proposed for parallel programs. The proposed task graph based static checkpointing approach has been implemented on a Parallel Virtual Machine (PVM) platform.
We now give a gist of various chapters of the thesis. Chapter 2 of the thesis gives a classification of existing checkpointing algorithms. The chapter surveys algorithm that have been reported in literature for checkpointing parallel/distributed systems. A point to be noted is that most of the algorithms published for checkpointing message passing systems are based on the seminal article by Chandy & Lamport. A large number of checkpointing algorithms have been published by relaxing the assumptions made in the above mentioned article and by extending the features to minimise the overheads of coordination and context saving.
Checkpointing for shared memory systems primarily extend cache coherence protocols to maintain a consistent memory. All of them assume that the main memory is safe for storing the context. Recently algorithms have been published for distributed shared memory systems, which extend the cache coherence protocols used in shared memory systems. They however also include methods for storing the status of distributed memory
in stable storage. Chapter 2 concludes with brief comments on the desirable features of a
checkpointing algorithm.
In Chapter 3, we develop a dynamic checkpointing algorithm for message passing systems assuming that the clock drift of processors in the system is bounded. Efficient clock synchronisation protocols have been implemented on recent parallel computers owing to the fact that communication between processors is very fast. Based on efficient clock synchronisation protocols, clock skew can be limited to a few microseconds. The algorithm proposed in the thesis uses clocks for checkpoint coordination and vector counts for identifying messages to be logged. The algorithm is a periodic, distributed algorithm. We prove correctness of the algorithm and compare it with similar clock based algorithms.
Distributed Shared Memory (DSM) systems provide the benefit of ease of programming in a scalable system. The recently proposed IEEE Scalable Coherent Interface (SCI) standard, facilitates the construction of scalable coherent systems. In Chapter 4 we discuss a checkpointing algorithm for an SCI based DSM system. SCI maintains cache coherence in hardware using a distributed cache directory which scales with the number of processors in the system. SCI recommends a two phase transaction protocol for communication. Our algorithm is a two phase centralised coordinated algorithm. Phase one initiates checkpoints and the checkpointing activity is completed in phase two. The correctness of the algorithm is established theoretically. The chapter concludes with the discussion of the features of SCI exploited by the checkpointing algorithm proposed in the thesis.
In Chapter 5, a static checkpointing algorithm is developed assuming that the program to be executed on a parallel computer is given as a directed acyclic task graph. We assume that the estimates of the time to execute each task in the task graph is given. Given the timing at which checkpoints are to be taken, the algorithm identifies a set of edges where checkpointing tasks can be placed ensuring that they form a consistent global checkpoint. The proposed algorithm eliminates coordination overhead at run time. It significantly reduces the context saving overhead by taking checkpoints along edges of the task graph. The algorithm is used as a preprocessing step before scheduling the tasks to processors. The algorithm complexity is O(km) where m is the number of edges in the graph and k the maximum number of global checkpoints to be taken.
The static algorithm is implemented on a parallel computer with a PVM environment as it is widely available and portable. The task graph of a program can be constructed manually or through program development tools. Our implementation is a collection of preprocessing and run time routines. The preprocessing routines operate on the task graph information to generate a set of edges to be checkpointed for each global checkpoint and write the information on disk. The run time routines save the context along the marked edges. In case of recovery, the recovery algorithms read the information from stable storage and reconstruct the context. The limitation of our static checkpointing algorithm is that it can operate only on deterministic task graphs. To demonstrate the practical feasibility of the proposed approach, case studies of checkpointing some parallel programs are included in the thesis.
We conclude the thesis with a summary of proposed algorithms and possible directions to continue research in the area of checkpointing.
|
2 |
Analyse probabiliste des systèmes temps réel / Probabilistic analysis of real-time systemsMaxim, Dorin 10 December 2013 (has links)
Les systèmes embarqués temps réel critiques intègrent des architectures complexes qui évoluent constamment afin d'intégrer des nouvelles fonctionnalités requises par les utilisateurs finaux des systèmes (automobile, avionique, ferroviaire, etc.). Ces nouvelles architectures ont un impact direct sur la variabilité du comportement temporel des systèmes temps réel. Cette variabilité entraîne un sur-approvisionnement important si la conception du système est uniquement basée sur le raisonnement pire cas. Approches probabilistes proposent des solutions basées sur la probabilité d'occurrence des valeurs les plus défavorables afin d'éviter le sur-approvisionnement, tout en satisfaisant les contraintes temps réel. Les principaux objectifs de ce travail sont de proposer des nouvelles techniques d'analyse des systèmes temps réel probabilistes et des moyens de diminuer la complexité de ces analyses, ainsi que de proposer des algorithmes optimaux d'ordonnancement à priorité fixe pour les systèmes avec des temps d'exécution décrits par des variables aléatoires. Les résultats que nous présentons dans ce travail ont été prouvés surs et à utiliser pour les systèmes temps réel durs, qui sont l'objet principal de notre travail. Notre analyse des systèmes avec plusieurs paramètres probabilistes a été démontrée considérablement moins pessimiste que d'autres types d'analyses. Cet analyse combinée avec des algorithmes d'ordonnancement optimaux appropriées pour les systèmes temps réel probabilistes peut aider les concepteurs de systèmes à mieux apprécier la faisabilité d'un système, en particulier de ceux qui sont jugé irréalisable par des analyses/algorithmes d'ordonnancement déterministes / Critical real-time embedded systems integrate complex architectures that evolve constantly in order to provide new functionality required by the end users of the systems (automotive, avionics, railway, etc). These new architectures have a direct impact on the variability of the timing behavior of the real-time system. This variability leads to important over-provisioning if the design of the system is based only on worst case reasoning. Probabilistic approaches propose solutions are based on the probability of occurrence of the worst case values in order to avoid over provisioning while satisfying real-time constraints. The main objectives of this work are new analysis techniques for probabilistic real-time systems and ways of decreasing the complexity of these analyses, as well as to propose optimal fixed priority scheduling algorithms for systems that have variability at the level of execution times. The results that we provide in this work have been proved applicable to hard real-time systems, which are the main focus of our work. Our proposed analysis for systems with multiple probabilistic parameters has been shown to greatly decrease the pessimism introduced by other types of analyses. This type of analysis combined with the proper optimal scheduling algorithms for probabilistic real-time system help the system designers to better appreciate the feasibility of a system, especially of those that are deemed unfeasible by deterministic analyses/scheduling algorithms
|
Page generated in 0.066 seconds