Scientific computing and machine learning are two major areas for modern computing demands. The former has applications in physics simulation and mathematical modeling while the latter has become the mainstream approach for tasks such as image classification and natural language processing. Despite the seemingly disparate application domains, the two types of workloads are essentially similar to each other in that their dominant computing kernels are both sparse/dense matrix-vector multiplications. Moreover, conventional computing platforms for these workloads - multicore CPUs/GPUs and microprocessors, consume a large amount of energy and/or in many cases still lack in performance for either cloud or edge applications. Therefore, it is an essential task to develop custom hardware accelerators that improve the energy efficiency and performance and alleviate the performance bottlenecks in CPUs/GPUs.
Toward this goal, the thesis presents four hardware accelerator/architecture designs demonstrating significant improvements in key metrics over prior art. The first design is a novel solver chip for partial differential equations (PDEs), a critical mathematical model in scientific computing. Prototyped in 65 nm, the chip features a programmable floating-point processor array architecture. It dramatically improves the range of mappable problems and the solution precision compared to prior art while being 40x faster under the same energy-delay product.
The second design, which is part of a system on a chip (SoC) manufactured in 28 nm, is an accelerator for the inference process of convolutional neural networks (CNNs), a popular type of neural network model in machine learning. It incorporates digital in-memory computing modules to efficiently execute the matrix-vector multiplication operations. The resulting SoC achieves 88x higher energy-delay product than prior art.
The third design aims to improve the energy efficiency of sparse matrix-vector multiplication (SpMV) by reducing off-chip data movement. Using the gzip compression algorithm, the design offline compresses matrix data in off-chip memory while decompressing it on the fly during computation runtime using custom on-chip decompression hardware. Prototyped in 28 nm, the chip achieves 2.32x system-level energy efficiency improvement over prior art.
The fourth design applies the on-the-fly gzip decompression to commercial GPU platforms, aiming at expanding the effective off-chip memory bandwidth. The design proposes a compressed block cache prefetching scheme to address the critical challenges of using gzip for memory decompression purposes in GPUs. Evaluated using open-source simulators, the design achieves 5.3-20.3% performance improvements over the baseline model.
Identifer | oai:union.ndltd.org:columbia.edu/oai:academiccommons.columbia.edu:10.7916/qa75-6t37 |
Date | January 2025 |
Creators | Huang, Xuanyuanliang (Paul) |
Source Sets | Columbia University |
Language | English |
Detected Language | English |
Type | Theses |
Page generated in 0.0015 seconds