Return to search

On the Programmability and Performance of OpenCL Designs for FPGA

Field programmable gate arrays (FPGAs) have been emerging as a promising bedrock to provide opportunities for several types of accelerators that spans across various domains such as finance, web-search, and data center networking, among others. Research interests facilitating the development of accelerators on FPGAs are increasing significantly, in particular, because of their effectiveness with a variety of applications, flexibility, and high performance per watt. However, several key challenges remain that hinder their large-scale deployment. Overcoming these challenges would enable them to match the pervasiveness of graphics processor units (GPUs), their principal competitors in this arena. One of the primary reasons responsible for the slow adaptation by programmers has been the programming model, which uses a low-level hardware description language (HDL).

Using HDLs require a detailed understanding of logic design and significant effort to implement and verify the behavioral models, with the latter growing with its complexity. Recent advancements in high-level language synthesis (HLS) tools have addressed this challenge to a considerable extent by allowing the programmers to write their applications in a high-level language named OpenCL. These applications are then compiled and synthesized to create a bitstream that configures the FPGA. This thesis characterizes the efficacy of HLS compiler optimizations that can be employed to improve the performance of these applications.

The synthesized hardware from OpenCL kernels is fundamentally different from traditional hardware such as CPUs and GPUs, which exploit instruction level parallelism (ILP) thread level parallelism (TLP), or data level parallelism (DLP) for performance gains. FPGAs typically use deep pipelining (i.e., ILP) for performance. A stall in this pipeline may severely undermine the performance of applications. Thus, it is imperative to identify and remove any such bottlenecks. To this end, this thesis presents and discusses a software-centric framework to debug and profile the synthesized designs generated using HLS tools. This thesis proposes basic code patterns, including a timestamp and a scalable framework, which can be plugged easily into OpenCL kernels, to collect and process run-time information dynamically. This scalable framework has a small overhead for area utilization and frequency but provides fine-grained information about the bottlenecks and latencies in design.

Additionally, although HLS tools have improved programmability, this may come at the cost of performance or area utilization. This thesis addresses this design trade-off via a comparative study of a hand-coded design in HDL and an architecturally similar, tool-generated design using an OpenCL compiler in the application area of 3D-stencil (i.e., structured grid) computation. Experiments in this thesis show that the performance of an OpenCL approach can achieve 95% of the peak attainable performance of a microkernel for multiple problem sizes. In comparison to the OpenCL approach, an HDL approach results in approximately 50% less memory usage and only 2% better performance on average. / MS / A hardware chip consists of switches or transistors, and a modern chip can have a few billions of them. Specifying the interconnection among these transistors and their placement on a chip is a complex problem. To simplify this, the chip-design flow uses automated tools and abstraction at the different levels of the flow, such as architecture, design, synthesis, placement, among others. During design, an engineer specifies the behavioral model in a hardware description language (HDL), which is later used by the automated tools for further processing. Using the HDL requires a detailed understanding of logic design and significant effort to implement and verify the behavioral models, with the latter growing with its complexity. Recent advancements in high-level language synthesis tools have addressed this challenge to a considerable extent by allowing the programmers to write their applications in a high-level language. This thesis characterizes the efficacy of such a tool and available optimizations that can be employed to improve the performance of these applications.

Additionally, this thesis presents and discusses a framework to debug and profile the designs generated using high-level synthesis tools, which can be plugged easily into an application, to collect and process run-time information dynamically. This scalable framework has a small overhead but provides fine-grained information about the bottlenecks in the design. Furthermore, the experiments in this work show that a design generated from a high-level synthesis tool has similar performance when compared to a manual design in HDL, at the expense of area utilization.

Identiferoai:union.ndltd.org:VTETD/oai:vtechworks.lib.vt.edu:10919/92699
Date09 February 2018
CreatorsVerma, Anshuman
ContributorsElectrical and Computer Engineering, Feng, Wu-chun, Zhou, Huiyang, Athanas, Peter M.
PublisherVirginia Tech
Source SetsVirginia Tech Theses and Dissertation
Detected LanguageEnglish
TypeThesis
FormatETD, application/pdf
RightsIn Copyright, http://rightsstatements.org/vocab/InC/1.0/

Page generated in 0.0025 seconds