Return to search

Accelerating Parallel Tasks by Optimizing GPU Hardware Utilization

<div>
<div>
<div>
<p>Efficient GPU applications rely on programmers carefully structure their codes to
fully utilize the GPU resources. In general, programmers spend a significant amount
of time optimizing their applications to run efficiently on domain-specific architectures. To reduce the burden on programmers to utilize GPUs fully, I create several
hardware and software solutions that improve the resource utilization on parallel
processors without significant programmer intervention.
</p><p><br></p>
<p>Recently, GPUs are increasingly being deployed in data centers to accelerate
latency-driven applications, which exhibit a modest amount of data parallelism. The
synchronous kernel execution on these applications cannot fully utilize the entire
GPU. Thus, a GPU contains multiple hardware queues to improve its throughput by
executing multiple kernels on a single device simultaneously when there are sufficient
hardware resources. However, a GPU faces severe underutilization when the space in
these queues has been exhausted, and the performance benefit vanishes with the decreased parallelism. As a result, I proposed a GPU runtime system – Pagoda, which
virtualizes the GPU hardware resources by using an OS-like daemon kernel called
MasterKernel. Tasks (kernels) are spawned from the CPU onto Pagoda as they be-
come available, and are scheduled by the MasterKernel at the warp granularity to
increase the GPU throughput for latency-driven applications. This work invents several programming APIs to handle task spawning and synchronization and includes
parallel tasks and warp scheduling policies to reduce runtime overhead.
</p>
</div>
</div>
<div>
<div>
<p><br></p>
</div>
</div>
</div>
<div>
<div>
<div>
<p>Latency-driven applications have both high throughput demands and response
time constraints. These applications may launch many kernels that do not fully utilize
the GPU unless grouped with large batch sizes. However, batching forces jobs to wait,
which increases their latency. This wait time can be unacceptable when considering
real-world arrival times of jobs. However, the round-robin GPU kernel scheduler
is oblivious to application deadlines. This deadline-blind scheduling policy makes it
harder to ensure that kernels meet their QoS deadlines. To enhance the responsiveness
of the GPU, I also proposed LAX, including an execution time estimate for jobs with
one or many kernels. Moreover, LAX adjusts priorities of kernels dynamically based
on their slack time to increase the number of jobs that complete by their real-time
deadlines. LAX improves the responsiveness and throughput of GPUs.
</p><p><br></p>
<p>It is well-known that grouping threads into warps can create redundancy across
scalar values in GPU vector registers. However, I also found that the layout of thread
indices in multi-dimensional threadblocks (TBs) creates redundancy in the registers
storing thread IDs. This redundancy propagates into dependent instructions that
can be traced and identified statically. To remove GPU redundant instructions, I
proposed DARSIE that uses a per-kernel compiler finalization check that uses TB
dimensions to determine which instructions are redundant. Once identified, DARSIE
hardware skips TB-redundant instructions before they are fetched. </p><p>DARSIE uses a
new multithreaded register renaming and instruction synchronization technique to
share the values from redundant instructions among warps in each TB. Altogether,
DARSIE decreases the number of executed instructions to improve GPU performance
and energy.
</p>
</div>
</div>
</div>

  1. 10.25394/pgs.12209618.v1
Identiferoai:union.ndltd.org:purdue.edu/oai:figshare.com:article/12209618
Date29 April 2020
CreatorsTsung-Tai Yeh (8775680)
Source SetsPurdue University
Detected LanguageEnglish
TypeText, Thesis
RightsCC BY 4.0
Relationhttps://figshare.com/articles/Accelerating_Parallel_Tasks_by_Optimizing_GPU_Hardware_Utilization/12209618

Page generated in 0.0027 seconds