Return to search

Efficient serverless resource scheduling for distributed deep learning.

Stemming from the growth and increased complexity of computer vision, natural language processing, and speech recognition algorithms; the need for scalability and fault tolerance of machine learning systems has risen. In order to comply with these demands many have turned their focus towards implementing machine learning on distributed systems. When running time demanding and resource intensive tasks like machine learning training on a cluster, resource efficiency is very important to keep training time low. To achieve efficient resource allocation a cluster scheduler is used. Standard scheduling frameworks are however not designed for deep learning, due to their static resource allocation. Most frameworks also do not make use of a serverless architecture, despite its ease of management and rapid scalability making it a fitting choice for deep learning tasks. Therefore we present Coach, a serverless job scheduler specialized for parameter server based deep learning models. Coach makes decisions to maximize resource efficiency and minimize training time through use of regression techniques to fit functions to data from previous training epochs. With Coach we attempt to answer three questions concerning the training speed (epochs/second) of deep learning models on a distributed system when using a serverless architecture. The three questions are as follows. One: does the addition of more workers and parameter servers have a positive impact on the training speed when running a varying number of concurrent training jobs? Two: can we see improved performance in regards to the training speed, when training is done in a distributed manner on a cluster with limited resources, compared to when it is done on a singular node? Three: how accurate are predictions made using fitted functions of previous training data at estimating the optimal number of workers and parameter servers to use during training, in order to maximize training speed? Due to limitations with the cluster used for testing we see that a minimal setup of a singular worker and server is almost always optimal. With results indicating that an additional server can have slight positive effects in some situations and an additional worker only appears positive in high variance situation where there are many jobs running at the same time. Which is theorized to be caused by choices made by the Kubernetes scheduler.

Identiferoai:union.ndltd.org:UPSALLA1/oai:DiVA.org:umu-187447
Date January 2021
CreatorsSundkvist, Johan
PublisherUmeå universitet, Institutionen för datavetenskap
Source SetsDiVA Archive at Upsalla University
LanguageEnglish
Detected LanguageEnglish
TypeStudent thesis, info:eu-repo/semantics/bachelorThesis, text
Formatapplication/pdf
Rightsinfo:eu-repo/semantics/openAccess
RelationUMNAD ; 1293

Page generated in 0.0017 seconds