Return to search

Resource management techniques aware of interference among high-performance computing applications

Network interference of nearby jobs has been recently identified as the dominant reason for the high performance variability of parallel applications running on High Performance Computing (HPC) systems. Typically, HPC systems are dynamic with multiple jobs coming and leaving in an unpredictable fashion, sharing simultaneously the system interconnection network. In such environment contention for network resources is causing random stalls in the progress of application execution degrading application's performance.
Eliminating interactions between jobs is the key for guaranteeing both high performance and performance predictability of applications. These interactions are determined by the job location in the system. Upon arriving to the system, the job is allocated the computing and network resources by resource managers. Based on the job size requirements, the job scheduler finds a set of available computing nodes. In addition, the subnet manager determines the allocation of the network resources such as paths between nodes, virtual lanes, link bandwidth. Typically, resource managers are mainly focused on increasing utilization of the resources while neglecting job interactions.
In this thesis, we propose techniques for both, job scheduler and subnet manager, able to mitigate job interactions: 1) a job scheduling policy that reduces the node fragmentation in the system, and 2) a quality-of-service (QoS) policy based on a characterization of job's network load; this policy is relaying on the virtual lanes mechanism provided by modern interconnection network (e.g. InfiniBand).
In order to evaluate our job scheduling policy we use a simulator developed for this thesis that takes as an input the job scheduler log from a production HPC system. This simulator performs the node allocation for the jobs from the log. The proposed QoS policy is evaluated using a flit-level network simulator that is able to replay multiple traces from real executions of MPI applications. Experimental results show that the proposed job scheduling policy leads to few jobs sharing network resources and thus having fewer job's interactions while the QoS policy is able to
effectively reduce the degradation from the remaining job's interactions. These two software techniques are complementary and could be used together without additional hardware.

Identiferoai:union.ndltd.org:TDX_UPC/oai:www.tdx.cat:10803/284934
Date19 December 2014
CreatorsJokanović, Ana
ContributorsSancho Pitarch, José Carlos, Rodríguez Herrera, German, Labarta Mancho, Jesús, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors
PublisherUniversitat Politècnica de Catalunya
Source SetsUniversitat Politècnica de Catalunya
LanguageEnglish
Detected LanguageEnglish
Typeinfo:eu-repo/semantics/doctoralThesis, info:eu-repo/semantics/publishedVersion
Format144 p., application/pdf
SourceTDX (Tesis Doctorals en Xarxa)
RightsL'accés als continguts d'aquesta tesi queda condicionat a l'acceptació de les condicions d'ús establertes per la següent llicència Creative Commons: http://creativecommons.org/licenses/by-nc-sa/3.0/es/, info:eu-repo/semantics/openAccess

Page generated in 0.0017 seconds