• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • No language data
  • Tagged with
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
1

IMPROVING THE UTILIZATION AND PERFORMANCE OF SPECIALIZED GPU CORES

Aaron M Barnes (20767127) 26 February 2025 (has links)
<p dir="ltr">Specialized hardware accelerators are becoming increasingly common to provide application performance gain despite the slowing trend of transistor scaling. Accelerators can adapt to the compute and data dependency patterns of an application to fully exploit the parallelism of the application and reduce data movement. However, specialized hardware is often limited by the application it was tailored to, which can lead to idle or inactive silicon in computations that do not match the exact patterns it was designed for. In this work I study two cases of GPU specialization and techniques that can be used to improve performance in a broader domain of applications. </p><p dir="ltr">First, I examine the effects of GPU core partitioning, a trend in contemporary GPUs to sub-divide core components to reduce area and energy overheads. Core partitioning is essentially a specialization of the hardware towards balanced applications, wherein the intra-core connectivity provides minimal benefit but takes up valuable on-chip area. I identify four orthogonal performance effects of GPU core sub-division, two of which have significant impact in practice: a bottleneck in the read operand stage caused by the reduced number of collector units and register banks allocated to each sub-core, and an instruction issue imbalance across sub-core schedulers caused by a simple round robin assignment of threads to sub-cores. To alleviate these issues I propose a Register Bank Aware (RBA) warp scheduler, which uses feedback from current register bank contention to inform thread scheduling decisions, and a hashed sub-core work scheduler to prevent pathological issue imbalances caused by round robin scheduling. I rigorously evaluate these designs in simulation and show they are able to capture 81% of the performance lost due to core subdivision. Further, I evaluate my techniques using synthesis tools and find that RBA is able to achieve performance equivalent to doubling the number of operand Collector Units (CUs) per sub-core with only a 1% increase in area and power.</p><p dir="ltr">Second, I study the inclusion of specialized ray tracing accelerator cores on GPUs. Specialized ray-tracing acceleration units have become a common feature in GPU hardware, enabling real-time ray-tracing of complex scenes for the first time. The ray-tracing unit accelerates the traversal of a hierarchical tree data structure called a bounding volume hierarchy to determine whether rays have intersected triangle primitives. Hierarchical search algorithms are a fundamental software pattern common in many important domains, such as recommendation systems and point cloud registration, but are difficult for GPUs to accelerate because they are characterized by extensive branching and recursion. The ray-tracing unit overcomes these limitations with specialized hardware to traverse hierarchical data structures efficiently, but is mired by a highly specialized graphics API, which is not readily adaptable to general-purpose computation. In this work I present the Hierarchical Search Unit (HSU), a flexible datapath to accelerate a more general class of hierarchical search algorithms, of which ray-tracing is one. I synthesize a baseline ray-intersection datapath and maximize functional unit reuse while extending the ray-tracing unit to support additional computations and a more general set of instructions. I demonstrate that the unit can improve the performance of three hierarchical search data structures in approximate nearest neighbors search algorithms and a B-tree key-value store index. For a minimal extension to the existing unit, HSU improves the state-of-the-art GPU approximate nearest neighbor implementation by an average of 24.8% using the GPU's general computing interface.</p>

Page generated in 0.0643 seconds