Return to search

Towards SLO-aware Resource Scheduling for Serverless Inference Workloads

The rapid advancement of Machine Learning (ML) and Deep Learning (DL) has revolutionized various domains, necessitating efficient and cost-effective ML inference capabilities. Function-as-a-Service (FaaS) has emerged as a promising approach for hosting ML inference services, providing a serverless computing environment that streamlines development cycles and offers scalability and simplified infrastructure management. However, existing autoscaling strategies employed by popular FaaS platforms often overlook critical factors such as response time and tail latency. Additionally, Python's Global Interpreter Lock (GIL) poses challenges for parallel computing in high-request traffic scenarios. This thesis addresses the need for efficient and cost-effective Machine Learning (ML) inference capabilities by exploring batching and autoscaling strategies for Serverless Inference instances. The study proposes a prototype FaaS framework that provides adaptive request batching, reactive autoscaling policies, and SLO monitoring, thus allowing Serverless Inference workloads to meet their SLO targets even during peak traffic. The proposed approach aims to optimize resource utilization, mitigate tail latency, and improve overall system performance. / Master of Science / Machine Learning (ML) and Deep Learning (DL) are advanced techniques that allow computers to learn from data and make predictions or decisions without being explicitly programmed. This has led to significant advancements in various fields. Inference refers to the process of applying a trained ML model to new data to make predictions or extract insights. In the context of ML, there is a growing need for efficient and cost-effective inference capabilities. A new approach called Function-as-a-Service (FaaS) has emerged that can address this need. FaaS is a way of abstracting the server infrastructure away from the developers. This means developers can focus on writing the ML code without worrying about managing the underlying infrastructure. FaaS offers benefits such as scalability, simplified infrastructure management, and faster development cycles. However, existing FaaS platforms face challenges in ensuring fast response times and handling high levels of incoming requests. This thesis aims to address these challenges by proposing a prototype FaaS framework. The framework incorporates adaptive request batching, reactive autoscaling policies, and Service-Level Objectives (SLOs) monitoring. Request batching allows the framework to process multiple requests together, improving efficiency. Autoscaling policies ensure the system dynamically adjusts its resources based on the incoming workload. Monitoring SLOs helps track and meet performance targets, even during peak traffic. By optimizing resource utilization, reducing delays in processing requests, and improving overall system performance, the proposed approach seeks to provide efficient and cost-effective ML inference capabilities in a serverless environment.

Identiferoai:union.ndltd.org:VTETD/oai:vtechworks.lib.vt.edu:10919/116005
Date08 August 2023
CreatorsTripathy, Abhijit
ContributorsComputer Science and Applications, Butt, Ali, Rafique, Muhammad Mustafa, Nikolopoulos, Dimitrios S.
PublisherVirginia Tech
Source SetsVirginia Tech Theses and Dissertation
LanguageEnglish
Detected LanguageEnglish
TypeThesis
FormatETD, application/pdf
RightsCreative Commons Attribution 4.0 International, http://creativecommons.org/licenses/by/4.0/

Page generated in 0.0019 seconds