It is projected that by 2026, most workloads in cloud data centers will be Deep Learning (DL) workloads. However, these workloads pose significant challenges due to their high computational demands, requiring infrastructure and platform advancements to meet DL’s performance, efficiency, and scalability requirements. One emerging problem in large-scale DL is the data stall issue, which occurs when DL models require extensive input data pre-processing, causing CPUs to struggle to keep up with the data consumption demands of GPUs during the training stage. This results in the DL pipeline stalling and GPUs running idle. Our work aims to fundamentally address the data stall issue in modern pre-processing DL pipelines. Traditional solutions involve allocating more CPUs to the pre-processing stage to meet GPU demands, but this approach significantly increases energy con- sumption and provisioning costs. For example, Meta recently disclosed that their DLRM pipeline requires 9 to 55 CPU servers per trainer node, depending on the workload. Our research explores offloading common pre-processing primi- tives to programmable network hardware, specifically Tofino2-equipped switches known for their high bandwidth and energy efficiency, and the Bluefield-2 Smart- NIC. Our initial power measurements demonstrate that Tofino2 and Bluefield-2 achieve 11.6x and 3.0x higher throughput per Watt, respectively, compared to a generic x86 or AMD CPU while performing pre-processing operations. However, due to Tofino2’s limitations in terms of the operations it can perform compared to a CPU, several design optimizations are required to fully exploit the potential of programmable network devices.
Identifer | oai:union.ndltd.org:kaust.edu.sa/oai:repository.kaust.edu.sa:10754/692272 |
Date | 04 1900 |
Creators | Zawawi, Omar |
Contributors | Canini, Marco, Computer, Electrical and Mathematical Science and Engineering (CEMSE) Division, Fahmy, Suhaib A., Keyes, David E. |
Source Sets | King Abdullah University of Science and Technology |
Language | English |
Detected Language | English |
Type | Thesis |
Relation | N/A |
Page generated in 0.0021 seconds