• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 1
  • Tagged with
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
1

Resource-Efficient Data Pre-Processing for Deep Learning

Zawawi, Omar 04 1900 (has links)
It is projected that by 2026, most workloads in cloud data centers will be Deep Learning (DL) workloads. However, these workloads pose significant challenges due to their high computational demands, requiring infrastructure and platform advancements to meet DL’s performance, efficiency, and scalability requirements. One emerging problem in large-scale DL is the data stall issue, which occurs when DL models require extensive input data pre-processing, causing CPUs to struggle to keep up with the data consumption demands of GPUs during the training stage. This results in the DL pipeline stalling and GPUs running idle. Our work aims to fundamentally address the data stall issue in modern pre-processing DL pipelines. Traditional solutions involve allocating more CPUs to the pre-processing stage to meet GPU demands, but this approach significantly increases energy con- sumption and provisioning costs. For example, Meta recently disclosed that their DLRM pipeline requires 9 to 55 CPU servers per trainer node, depending on the workload. Our research explores offloading common pre-processing primi- tives to programmable network hardware, specifically Tofino2-equipped switches known for their high bandwidth and energy efficiency, and the Bluefield-2 Smart- NIC. Our initial power measurements demonstrate that Tofino2 and Bluefield-2 achieve 11.6x and 3.0x higher throughput per Watt, respectively, compared to a generic x86 or AMD CPU while performing pre-processing operations. However, due to Tofino2’s limitations in terms of the operations it can perform compared to a CPU, several design optimizations are required to fully exploit the potential of programmable network devices.

Page generated in 0.0808 seconds