Return to search

Designing, Implementing and Programming a Massively Multithreaded Spatial Accelerator Architecture

Application specific integrated circuit (ASIC) accelerators are widely seen as the go-to-solution for improving application performance and energy efficiency in the face of the collapse of Dennard scaling and the so-called “power wall” facing modern computer systems design. Given this reality, the proportion of modern system-on-chip (SoC) die area which can be truly called a programmable general-purpose computer has been in rapid decline for at least the past decade. This shrinking natural habitat for end-user-developed computer programs on the physical substrate of modern computer systems, however, is at odds with computers’ greatest asset: their flexibility and their inherent ability to perform whatever wishes exist in the minds of their programmers.

Programmable accelerators exist already as part of the evolving landscape of heterogeneous computing to address the performance gap between general-purpose application processors and these specialized ASICs. Graphics processing units (GPUs) push pixels, with their processing elements (PEs) in lockstep, and can train neural networks among other embarrassingly parallel tasks, and digital signal processors (DSPs) connect our smartphones to the analog world of radio frequency communications. Field programmable gate arrays (FPGAs) allow digital designers to create logic circuits that can emulate ASICs and share their ability to express spatial parallelism but can be reprogrammed as needed. Yet, FPGAs pay a price for their flexibility: their bitwise reconfigurable digital logic and low compute density can often yield order-of-magnitude reductions in power-performance-area (PPA) figures of merit compared to ASICs, and in order to achieve a high quality of result, FPGAs typically need users with digital design expertise. Spatial architectures seek to reduce the PPA gulf between ASICs and currently widely deployed programmable accelerators by allowing programmers to express compute kernels in high level programming languages that assume minimal specialized knowledge and run these kernels on a parallel substrate of fixed instruction set architecture (ISA) PEs.

These hardened VLSI datapaths can take advantage of spatial parallelism and irregular control flow with PPA results closer to ASICs than the existing mainstream options for programmable acceleration. This thesis focuses on a particular class of spatial architectures called locally autonomous spatial architectures, in which PEs function as tiny processors unto themselves, with completely independent control and their own separate programs, communicating with other PEs over a spatial interconnect fabric and collaborating with other PEs to cooperatively perform parallel computation. They can accomplish this with improved scalability over centrally controlled alternative spatial architectures due to lack of physically intractable global control mechanisms.

This thesis proposes a complete spatial kernel acceleration system. It establishes a model of computation for the underlying accelerator architecture based on dataflow process networks (DPNs), and develops an original “spatial thread” (ST) accelerator architecture for executing kernels defined as DPNs on a spatial substrate. It then performs a detailed microarchitectural analysis to arrive at an optimal microarchitectural scheme for spatial PEs and weighs the relative benefits of conventional heavyweight packet-switched networks-on-chip (NoCs) as opposed to lightweight circuit switched alternatives for the spatial array. It then describes the design, implementation, verification and testing of a complete test ultra-low power (ULP) integrated circuit (IC) for an instance of the proposed accelerator architecture and microarchitecture named “Catena” and discusses the low-power techniques used to achieve energy efficiency. It then outlines a proposed compiler scheme for compiling sequential OpenCL C kernels via LLVM to DPNs to be executed by the accelerator.

Identiferoai:union.ndltd.org:columbia.edu/oai:academiccommons.columbia.edu:10.7916/1ajj-sn76
Date January 2023
CreatorsRepetti, Thomas James
Source SetsColumbia University
LanguageEnglish
Detected LanguageEnglish
TypeTheses

Page generated in 0.0013 seconds