• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 1
  • Tagged with
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
1

Design of low-cost multi-thread unified shader architecture

Sun, Ya-hsien 14 February 2011 (has links)
In order to increase the data-path utilization of the programmable graphics processor units (GPU) which often stall by waiting for the execution results of those long-latency instructions, multi-thread technique is very often used in the design of GPU. This thesis proposes a multi-thread single unified core GPU design which owns several key features. First, its processor core can execute not only the vertex and fragment shading programs, but also the software rasteriation module which is mostly implemented by a individual hardware module in other GPU designs. Next, the thread-switching policy in our design is based on the non-preempt blocked scheduling. Normally, whether an instruction will be stalled cannot be detected until it enters the instruction-decode stage. In order to achieve zero-penalty thread switching, a single assistant bit will be padded to each instruction in a thread to tell if the next instruction in the same thread will be stalled or not. This mechanism can help achieve a speed-up of 1.4 in some benchmarks used in this thesis. The register file used in GPU processor is usually equipped with up to four access ports, such that it will occupy a significant portion of the entire GPU especially for muti-thread designs where the register set has to be duplicated by several copies. The implementation cost of the register file can be reduced by decreasing its access port number to two based on the proposed multi-bank approach in this thesis. Our experimental results show that this approach can help reduce the overall gate count by 26.12%. Finally, the rest of fixed-pipeline fragment operation is realized by an iterative time-sharing architecture in order to further save the silicon area. The overall gate count of the proposed GPU is 600K.

Page generated in 0.1422 seconds