Global ETD Search

Return to search

COLLECTIVE COMMUNICATION AND BARRIER SYNCHRONIZATION ON NVIDIA CUDA GPU

GPUs (Graphics Processing Units) employ a multi-threaded execution model using multiple SIMD cores. Compared to use of a single SIMD engine, this architecture can scale to more processing elements. However, GPUs sacrifice the timing properties which made barrier synchronization implicit and collective communication operations fast.
This thesis demonstrates efficient methods by which these aggregate functions can be implemented using unmodified NVIDIA CUDA GPUs. Although NVIDIA's highest “compute capability" GPUs provide atomic memory functions, they have order N execution time. In contrast, the methods proposed here take advantage of basic properties of the GPU architecture to make implementations that are both efficient and portable to all CUDA-capable GPUs. A variety of coordination operations are synthesized, and the algorithm, CUDA code, and performance of each are discussed in detail.

GPU

barrier synchronization

CUDA

constant time race resolution

global block synchronization

Electrical and Computer Engineering

Identifer	oai:union.ndltd.org:uky.edu/oai:uknowledge.uky.edu:gradschool_theses-1639
Date	01 January 2009
Creators	Rivera-Polanco, Diego Alejandro
Publisher	UKnowledge
Source Sets	University of Kentucky
Detected Language	English
Type	text
Format	application/pdf
Source	University of Kentucky Master's Theses

Page generated in 0.0018 seconds

COLLECTIVE COMMUNICATION AND BARRIER SYNCHRONIZATION ON NVIDIA CUDA GPU

Description

Links & Downloads

Tags

Additional Fields