Return to search

Run-time loop parallelization with efficient dependency checking on GPU-accelerated platforms

General-Purpose computing on Graphics Processing Units (GPGPU) has attracted a lot of attention recently. Exciting results have been reported in using GPUs to accelerate applications in various domains such as scientific simulations, data mining, bio-informatics and computational finance. However, up to now GPUs can only accelerate data-parallel loops with statically analyzable parallelism. Loops with dynamic parallelism (e.g., with array accesses through subscripted subscripts), an important pattern in many general-purpose applications, cannot be parallelized on GPUs using existing technologies.



Run-time loop parallelization using Thread Level Speculation (TLS) has been proposed in the literatures to parallelize loops with statically un-analyzable dependencies. However, most of the existing TLS systems are designed for multiprocessor/multi-core CPUs. GPUs have fundamental differences with CPUs in both hardware architecture and execution model, making the previous TLS designs not work or inefficient when ported to GPUs. This thesis presents GPUTLS, a runtime system designed to support speculative loop parallelization on GPUs. The design of GPU-TLS addresses several key problems encountered when adapting TLS to GPUs: (1) To reduce the possibility of mis-speculation, deferred-update memory versioning scheme is adopted to avoid mis-speculations caused by inter-iteration WAR and WAW dependencies. A technique named intra-warp value forwarding is proposed to respect some inter-iteration RAW dependencies, which further reduces the mis-speculation possibility. (2) An incremental speculative execution scheme is designed to exploit partial parallelism within loops. This avoids excessive re-executions and reduces the mis-speculation penalty. (3) The dependency checking among thousands of speculative GPU threads poses large overhead and can easily become the performance bottleneck. To lower the overhead, we design several e_cient dependency checking schemes named PRW+BDC, SW, SR, SRW+EDC, and SRW+LDC respectively. (4) We devise a novel parallel commit scheme to avoid the overhead incurred by the serial commit phase in most existing TLS designs.



We have carried out extensive experiments on two platforms with different NVIDIA GPUs, using both a synthetic loop that can simulate loops with different characteristics and several loops from real-life applications. Testing results show that the proposed intra-warp value forwarding and eager dependency checking techniques can improve the performance for almost all kinds of loop patterns. We observe that compared with other dependency checking schemes, SR and SW can achieve better performance in most cases. It is also shown that the proposed parallel commit scheme is especially useful for loops with large write set size and small number of inter-iteration WAW dependencies. Overall, GPU-TLS can achieve speedups ranging from 5 to 105 for loops with dynamic parallelism. / published_or_final_version / Computer Science / Master / Master of Philosophy

  1. 10.5353/th_b4716765
  2. b4716765
Identiferoai:union.ndltd.org:HKU/oai:hub.hku.hk:10722/180014
Date January 2011
CreatorsZhang, Chenggang, 张呈刚
PublisherThe University of Hong Kong (Pokfulam, Hong Kong)
Source SetsHong Kong University Theses
LanguageEnglish
Detected LanguageEnglish
TypePG_Thesis
Sourcehttp://hub.hku.hk/bib/B47167658
RightsThe author retains all proprietary rights, (such as patent rights) and the right to use in future works., Creative Commons: Attribution 3.0 Hong Kong License
RelationHKU Theses Online (HKUTO)

Page generated in 0.0024 seconds