Global ETD Search

Return to search

Tuned and asynchronous stencil kernels for CPU/GPU systems

We describe heterogeneous multi-CPU and multi-GPU implementations of Jacobi's iterative method for the 2-D Poisson equation on a structured grid, in both single- and double-precision. Properly tuned, our best implementation achieves 98% of the empirical streaming GPU bandwidth (66% of peak) on a NVIDIA C1060. Motivated to find a still faster implementation, we further consider "wildly asynchronous" implementations that can reduce or even eliminate the synchronization bottleneck between iterations. In these versions, which are based on the principle of a chaotic relaxation (Chazan and Miranker, 1969), we simply remove or delay synchronization between iterations, thereby potentially trading off more flops (via more iterations to converge) for a higher degree of asynchronous parallelism. Our relaxed-synchronization implementations on a GPU can be 1.2-2.5x faster than our best synchronized GPU implementation while achieving the same accuracy. Looking forward, this result suggests research on similarly "fast-and-loose" algorithms in the coming era of increasingly massive concurrency and relatively high synchronization or communication costs.

http://hdl.handle.net/1853/29728

Hybrid

High performance computing

Architecture

Chaotic relaxation

Tesla

Linear system of equations

Iterative methods (Mathematics)

Kernel functions

Identifer	oai:union.ndltd.org:GATECH/oai:smartech.gatech.edu:1853/29728
Date	18 May 2009
Creators	Venkatasubramanian, Sundaresan
Publisher	Georgia Institute of Technology
Source Sets	Georgia Tech Electronic Thesis and Dissertation Archive
Detected Language	English
Type	Thesis

Page generated in 0.0065 seconds

Tuned and asynchronous stencil kernels for CPU/GPU systems

Description

Links & Downloads

Tags

Additional Fields