In the era of artificial intelligence and machine learning, AI/ML inference tasks have become exceedingly popular. However, executing these workloads on dedicated hardware may not be feasible for many users due to high maintenance costs, varying load patterns, and time to production. Furthermore, ML inference workloads are stateless, and most of them are not extremely latency sensitive. For example, tasks such as fake review removal, abusive language detection, tweet classification, image tagging, and free-tier-chat-bots do not require real-time inference. All these characteristics make serverless platforms a good fit for deployment, and in this work, we identify the bottlenecks involved in hosting these inference jobs on serverless and optimize serverless for better performance and resource utilization. Specifically, we identify model loading and model memory duplication as major bottlenecks in Serverless Inference, and to address these problems, we propose a new approach that rethinks the way we serve FaaS requests. To support this design, we employ a hybrid scaling approach to implement the autoscale feature of serverless. / Master of Science / Most modern software applications leverage the power of machine learning to incorporate intelligent features. For instance, platforms like Yelp employ machine learning algorithms to detect fake reviews, while intelligent chatbots such as ChatGPT provide interactive conversations. Even Netflix relies on machine learning to recommend personalized content to its users. The process of creating these machine learning services involves several stages, including data collection, model training using the collected data, and serving the trained model to deploy the service. This final stage, known as inference, is crucial for delivering real-time predictions or responses to user queries. In our research, we focus on selecting serverless computing as the preferred infrastructure for deploying these popular inference workloads.
Serverless, also referred to as Function as a Service (FaaS), is an execution paradigm in cloud computing that allows users to efficiently run their code by providing scalability, elasticity and fine-grained billing. In this work we identified, model loading and model memory duplication as major bottlenecks in Serverless Inference. To solve these problems we propose a new approach which rethinks the way we serve FaaS requests. To support this design we use a hybrid scaling approach to implement the autoscale feature of serverless.
Identifer | oai:union.ndltd.org:VTETD/oai:vtechworks.lib.vt.edu:10919/116068 |
Date | 21 August 2023 |
Creators | Ellore, Anish Reddy |
Contributors | Computer Science and Applications, Butt, Ali, Hu, Liting, Williams, Daniel John |
Publisher | Virginia Tech |
Source Sets | Virginia Tech Theses and Dissertation |
Language | English |
Detected Language | English |
Type | Thesis |
Format | ETD, application/pdf |
Rights | In Copyright, http://rightsstatements.org/vocab/InC/1.0/ |
Page generated in 0.0019 seconds