Global ETD Search

Return to search

GEMS: A Fault Tolerant Grid Job Management System

The Grid environments are inherently unstable. Resources join and leave the environment without any prior notification. Application fault detection, checkpointing and restart is of foremost importance in the Grid environments. The need for fault tolerance is especially acute for large parallel applications since the failure rate grows with the number of processors and the duration of the computation.

A Grid job management system hides the heterogeneity of the Grid and the complexity of the Grid protocols from the user. The user submits a job to the Grid job management system and it finds the appropriate resource, submits the job and transfers the output files to the user upon job completion. However, current Grid job management systems do not detect application failures.

The goal of this research is to develop a Grid job management system that can efficiently detect application failures. Failed jobs are restarted either on the same resource or the job is migrated to another resource and restarted. The research also aims to identify the role of local resource managers in the fault detection and migration of Grid applications. / Master of Science

fault tolerance

grid computing

grid job management systems

local resource manager

job migration

Identifer	oai:union.ndltd.org:VTETD/oai:vtechworks.lib.vt.edu:10919/9661
Date	08 January 2004
Creators	Tadepalli, Sriram Satish
Contributors	Computer Science, Ribbens, Calvin J., Kafura, Dennis G., Varadarajan, Srinidhi
Publisher	Virginia Tech
Source Sets	Virginia Tech Theses and Dissertation
Detected Language	English
Type	Thesis
Format	ETD, application/pdf
Rights	In Copyright, http://rightsstatements.org/vocab/InC/1.0/
Relation	thesis.pdf

Page generated in 0.002 seconds

GEMS: A Fault Tolerant Grid Job Management System

Description

Links & Downloads

Tags

Additional Fields