Spelling suggestions: "subject:"bob duplication"" "subject:"oob duplication""
1 |
The Design of Fault Tolerance of Cluster Computing PlatformLiao, Yu-tien 29 August 2012 (has links)
If nodes got failed in a distributed application service, it will not only pay more cost to handle with these results missing, but also make scheduler cause additional loadings. For whole results don¡¦t recalculated cause by fault occurs, it will be recalculated data of fault nodes in backup machines. Therefore, this paper uses three methods: N + N nodes, N + 1 nodes, and N + 1 nodes with probability to experiment and analyze their pros and cons, the third way gives jobs weight before assigning them, and converts weight into probability and nice value(defined by SLURM[1]) to influence scheduler¡¦s decision of jobs¡¦ order. When fault occurs, calculating in normal nodes¡¦ results will back to control node, and then the fault node¡¦s jobs are going to be reassigned or not be reassigned to backup machine for getting complete results. Finally, we will analyze these three ways good and bad.
|
Page generated in 0.0934 seconds