Global ETD Search

Return to search

The Design of Fault Tolerance of Cluster Computing Platform

If nodes got failed in a distributed application service, it will not only pay more cost to handle with these results missing, but also make scheduler cause additional loadings. For whole results don¡¦t recalculated cause by fault occurs, it will be recalculated data of fault nodes in backup machines. Therefore, this paper uses three methods: N + N nodes, N + 1 nodes, and N + 1 nodes with probability to experiment and analyze their pros and cons, the third way gives jobs weight before assigning them, and converts weight into probability and nice value(defined by SLURM[1]) to influence scheduler¡¦s decision of jobs¡¦ order. When fault occurs, calculating in normal nodes¡¦ results will back to control node, and then the fault node¡¦s jobs are going to be reassigned or not be reassigned to backup machine for getting complete results. Finally, we will analyze these three ways good and bad.

http://etd.lib.nsysu.edu.tw/ETD-db/ETD-search/view_etd?URN=etd-0829112-205635

distributed computing

Identifer	oai:union.ndltd.org:NSYSU/oai:NSYSU:etd-0829112-205635
Date	29 August 2012
Creators	Liao, Yu-tien
Contributors	Cheng-Fu Chou, Chun-Hung Lin, Hsiao-Guang Wu, Shi-Huang Chen, Ying-Chih Lin
Publisher	NSYSU
Source Sets	NSYSU Electronic Thesis and Dissertation Archive
Language	Cholon
Detected Language	English
Type	text
Format	application/pdf
Source	http://etd.lib.nsysu.edu.tw/ETD-db/ETD-search/view_etd?URN=etd-0829112-205635
Rights	user_define, Copyright information available at source archive

Page generated in 0.0017 seconds

The Design of Fault Tolerance of Cluster Computing Platform

Description

Links & Downloads

Tags

Additional Fields