Global ETD Search

Return to search

Comparing a gang-like scheduler with the default Kubernetes scheduler in a multi-tenant serverless distributed deep learning training environment

Systems for running distributed deep learning training on the cloud have recently been developed. An important component of a distributed deep learning job handler is its resource allocation scheduler. This scheduler allocates computing resources to parts of a distributed training architecture. In this thesis, a serverless distributed deep learning job handler using Kubernetes was built to compare the job completion time when two different Kubernetes schedulers are used. The default Kubernetes scheduler and a gang-like custom scheduler. These schedulers were compared by performing experiments with different configurations of deep learning models, resource count selection and number of concurrent jobs. No significant difference in job completion time between the schedulers could be found. However, two benefits were found in the gang scheduler compared to the default scheduler. First, prevention of resource deadlocks where one or multiple jobs are locking resources but are unable to start. Second, reduced risk of epoch straggling, where jobs are allocated too few workers to be able to complete epochs in a reasonable time. Thus preventing other jobs from using the resources locked by the straggler job.

http://urn.kb.se/resolve?urn=urn:nbn:se:umu:diva-189688

Datavetenskap (datalogi)

Identifer	oai:union.ndltd.org:UPSALLA1/oai:DiVA.org:umu-189688
Date	January 2021
Creators	Lövenvald, Frans-Lukas
Publisher	Umeå universitet, Institutionen för datavetenskap
Source Sets	DiVA Archive at Upsalla University
Language	English
Detected Language	English
Type	Student thesis, info:eu-repo/semantics/bachelorThesis, text
Format	application/pdf
Rights	info:eu-repo/semantics/openAccess
Relation	UMNAD ; 1298

Page generated in 0.002 seconds

Comparing a gang-like scheduler with the default Kubernetes scheduler in a multi-tenant serverless distributed deep learning training environment

Description

Links & Downloads

Tags

Additional Fields