Systems for running distributed deep learning training on the cloud have recently been developed. An important component of a distributed deep learning job handler is its resource allocation scheduler. This scheduler allocates computing resources to parts of a distributed training architecture. In this thesis, a serverless distributed deep learning job handler using Kubernetes was built to compare the job completion time when two different Kubernetes schedulers are used. The default Kubernetes scheduler and a gang-like custom scheduler. These schedulers were compared by performing experiments with different configurations of deep learning models, resource count selection and number of concurrent jobs. No significant difference in job completion time between the schedulers could be found. However, two benefits were found in the gang scheduler compared to the default scheduler. First, prevention of resource deadlocks where one or multiple jobs are locking resources but are unable to start. Second, reduced risk of epoch straggling, where jobs are allocated too few workers to be able to complete epochs in a reasonable time. Thus preventing other jobs from using the resources locked by the straggler job.
Identifer | oai:union.ndltd.org:UPSALLA1/oai:DiVA.org:umu-189688 |
Date | January 2021 |
Creators | Lövenvald, Frans-Lukas |
Publisher | Umeå universitet, Institutionen för datavetenskap |
Source Sets | DiVA Archive at Upsalla University |
Language | English |
Detected Language | English |
Type | Student thesis, info:eu-repo/semantics/bachelorThesis, text |
Format | application/pdf |
Rights | info:eu-repo/semantics/openAccess |
Relation | UMNAD ; 1298 |
Page generated in 0.002 seconds