Many algorithms and applications in scientific computing exhibit irregular access patterns as consecutive accesses are dependent on the structure of the data being processed and as such cannot be known a priori. This manifests itself as a lack of temporal and spatial locality meaning these applications often perform poorly in traditional processor cache hierarchies. This thesis demonstrates that heterogeneous architectures containing Field Programmable Gate Arrays (FPGAs) alongside traditional processors can improve memory access throughput by 2-3x by using the FPGA to insert data directly into the processor cache, eliminating costly cache misses. When fetching data to be processed directly on the FPGA, scatter-gather Direct Memory Access (DMA) provides the best performance but its storage format is inefficient for these classes of applications. The presented optimised storage and generation of these descriptors on-demand leads to a 16x reduction in on-chip Block RAM usage and a 2/3 reduction in data transfer time. Traditional scatter-gather DMA requires a statically defined list of access instructions and is managed by a host processor. The system presented in this thesis expands the DMA operation to allow data-driven memory requests in response to processed data and brings all control on-chip allowing autonomous operation. This dramatically increases system flexibility and provides a further 11% performance improvement. Graph applications and algorithms for traversing and searching graph data are used throughout this thesis as a motivating example for the optimisations presented, though they should be equally applicable to a wide range of irregular applications within scientific computing.
Identifer | oai:union.ndltd.org:bl.uk/oai:ethos.bl.uk:695550 |
Date | January 2016 |
Creators | Bean, Andrew |
Contributors | Cheung, Peter |
Publisher | Imperial College London |
Source Sets | Ethos UK |
Detected Language | English |
Type | Electronic Thesis or Dissertation |
Source | http://hdl.handle.net/10044/1/41981 |
Page generated in 0.0024 seconds