Global ETD Search

Return to search

The Mystery of the Failing Jobs: Insights from Operational Data from Two University-Wide Computing Systems

Node downtime and failed jobs in a computing cluster translate into wasted resources and user dissatisfaction. Therefore understanding why nodes and jobs fail in HPC clusters is essential. This paper provides analyses of node and job failures in two university-wide computing clusters at two Tier I US research universities. We analyzed approximately 3.0M job execution data of System A and 2.2M of System B with data sources coming from accounting logs, resource usage for all primary local and remote resources (memory, IO, network), and node failure data. We observe different kinds of correlations of failures with resource usages and propose a job failure prediction model to trigger event-driven checkpointing and avoid wasted work. We provide generalizable insights for cluster management to improve reliability, such as, for some execution environments local contention dominates, while for others system-wide contention dominates.

10.25394/pgs.9044138.v1

Computer Engineering

Distributed Computing

Distributed and Grid Systems

High Performance Computing

Performance Evaluation

Failure Analysis

Failure Prediction

Identifer	oai:union.ndltd.org:purdue.edu/oai:figshare.com:article/9044138
Date	14 August 2019
Creators	Rakesh Kumar (7039253)
Source Sets	Purdue University
Detected Language	English
Type	Text, Thesis
Rights	CC BY 4.0
Relation	https://figshare.com/articles/The_Mystery_of_the_Failing_Jobs_Insights_from_Operational_Data_from_Two_University-Wide_Computing_Systems/9044138

Page generated in 0.0021 seconds

The Mystery of the Failing Jobs: Insights from Operational Data from Two University-Wide Computing Systems

Description

Links & Downloads

Tags

Additional Fields