Modern systems for Machine Learning (ML) workloads support heterogeneous workloads and resources. However, existing resource managers in these systems do not differentiate between heterogeneous GPU resources. Moreover, users are usually unaware of the sufficient and appropriate type and amount of GPU resources to request for their ML jobs. In this thesis, we analyze the performance of ML training and inference jobs and identify ML model and GPU characteristics that impact this performance. We then propose ML-based prediction models to accurately determine appropriate and sufficient resource requirements to ensure improved job latency and GPU utilization in the cluster. / Doctor of Philosophy / We daily interact with and use many software applications such as social media, e-commerce, healthcare, and finance. These applications rely on different computing systems as well as artificial intelligence to deliver users the best service and experience. In this dissertation, we present optimizations to improve the performance of these artificial intelligence applications while at the same time improving the performance and the utilization of the systems and the heterogeneous resources they run on. We propose utilizing machine learning models, that learn from historical data of application performance as well as application and resource characteristics, to predict the necessary and sufficient resource requirements for these applications to ensure the optimal performance for the application and the underlying system.
Identifer | oai:union.ndltd.org:VTETD/oai:vtechworks.lib.vt.edu:10919/114021 |
Date | 01 March 2023 |
Creators | Albahar, Hadeel Ahmad |
Contributors | Electrical and Computer Engineering, Butt, Ali, Anwar, Ali, Chantem, Thidapat, Min, Chang Woo, Tilevich, Eli |
Publisher | Virginia Tech |
Source Sets | Virginia Tech Theses and Dissertation |
Language | English |
Detected Language | English |
Type | Dissertation |
Format | ETD, application/pdf |
Rights | In Copyright, http://rightsstatements.org/vocab/InC/1.0/ |
Page generated in 0.002 seconds