Global ETD Search

1	The Design of Fault Tolerance of Cluster Computing Platform Liao, Yu-tien 29 August 2012 (has links) If nodes got failed in a distributed application service, it will not only pay more cost to handle with these results missing, but also make scheduler cause additional loadings. For whole results don¡¦t recalculated cause by fault occurs, it will be recalculated data of fault nodes in backup machines. Therefore, this paper uses three methods: N + N nodes, N + 1 nodes, and N + 1 nodes with probability to experiment and analyze their pros and cons, the third way gives jobs weight before assigning them, and converts weight into probability and nice value(defined by SLURM[1]) to influence scheduler¡¦s decision of jobs¡¦ order. When fault occurs, calculating in normal nodes¡¦ results will back to control node, and then the fault node¡¦s jobs are going to be reassigned or not be reassigned to backup machine for getting complete results. Finally, we will analyze these three ways good and bad. SLURM cluster system Job duplication fault-tolerance distributed computing