Return to search

The Design of Fault Tolerance of Cluster Computing Platform

If nodes got failed in a distributed application service, it will not only pay more cost to handle with these results missing, but also make scheduler cause additional loadings. For whole results don¡¦t recalculated cause by fault occurs, it will be recalculated data of fault nodes in backup machines. Therefore, this paper uses three methods: N + N nodes, N + 1 nodes, and N + 1 nodes with probability to experiment and analyze their pros and cons, the third way gives jobs weight before assigning them, and converts weight into probability and nice value(defined by SLURM[1]) to influence scheduler¡¦s decision of jobs¡¦ order. When fault occurs, calculating in normal nodes¡¦ results will back to control node, and then the fault node¡¦s jobs are going to be reassigned or not be reassigned to backup machine for getting complete results. Finally, we will analyze these three ways good and bad.

Identiferoai:union.ndltd.org:NSYSU/oai:NSYSU:etd-0829112-205635
Date29 August 2012
CreatorsLiao, Yu-tien
ContributorsCheng-Fu Chou, Chun-Hung Lin, Hsiao-Guang Wu, Shi-Huang Chen, Ying-Chih Lin
PublisherNSYSU
Source SetsNSYSU Electronic Thesis and Dissertation Archive
LanguageCholon
Detected LanguageEnglish
Typetext
Formatapplication/pdf
Sourcehttp://etd.lib.nsysu.edu.tw/ETD-db/ETD-search/view_etd?URN=etd-0829112-205635
Rightsuser_define, Copyright information available at source archive

Page generated in 0.0017 seconds