Return to search

Mapping recursive functions to reconfigurable hardware

Reconfigurable computing is a method of development that provides a developer with the ability to reprogram a hardware device. In the specific case of FPGAs this allows for rapid and cost effective implementation of hardware devices when compared to standard a ASIC design, coupled with an increase in performance when compared to software based solutions. With the advent of development tools such as Celoxica's DK package and Xilinx's Forge package, that support languages traditionally associated with software development, a change in the skill sets required to develop FPGA solutions from hardware designers to software programmers is possible and perhaps desirable to increase the adoption of FPGA technologies. To support developers with these skill sets tools should closely mirror current software development tools in terms of language, syntax and methodology, while at the same time both transparently and automatically take advantage of as much of the increased performance that reconfigurable architectures can provide over traditional software architectures by utilizing the parallelism and the ability to create arbitrary depth pipelines which is not present in traditional microprocessor designs. A common feature of many programming languages that is not supported by many higher level design tools is recursion. Recursion is a powerful method used to elegantly describe many algorithms. Recursion is typically implemented by using a stack to store arguments, context and a return address for function calls. This however limits the controlling hardware to running only a single function at any moment which eliminates an algorithm's ability to take advantage of the parallelism available between successive iterations of a recursive function. This squanders the high amount of parallelism provided by the resources on the FPGA thus reducing the performance of the recursive algorithm. This thesis presents a method to address the lack of support for recursion in design tools that exploits the parallelism available between recursive calls. It does this by unrolling the recursion into a pipeline, in a similar manner to the pipeline obtained from loop unrolling, and then streaming the data through the resulting pipeline. However essential differences between loops and recursive functions such as multiple recursive calls in a function, and hence multiple unrollings, and post-recursive statements add further complexity to the issue of unrolling as the pipeline may take a non-linear shape and contain heterogeneous stages. Unrolling the recursive function on the FPGA increases the parallelism available, however the depth of the pipline and therefore the amount of parallelism available, is limited by the finite resources on the FPGA. To make efficient use of the resources on the FPGA the system must be able to unroll the function in a way to best suit the input but also must ensure that the function is not unrolled past its maximum recursive depth. A trivial solution such as unrolling on-demand introduces a latency into the system when a further instance of the function is unrolled that reduces overall performance. To reduce this penalty it is desirable for the system to be able to predict the behaviour of the recursive function based on the input data and unroll the function to a suitable length prior to it being required. Accurate prediction is possible in cases where the condition for recursion is a simple function on the arguments, however in cases where the condition for recursion is based on complex functions, such as the entire recursive function, accurate prediction is not possible. In situations such as this a heuristic is used which provides a close approximation to the correct depth of recursion at any given time. This prediction allows the system to reduce the performance penalty from real time unrolling without over utilization of the the FPGA resources. Results obtained demonstrate the increase in performance for various recursive functions obtained from the increased parallelism, when compared to a stack based implementation on the same device. In certain instances due to constraints on hardware availability results were gained from device simulation using a simulator developed for this purpose. Details of this simulator are presented in this thesis.

Identiferoai:union.ndltd.org:ADTP/230141
Date January 2005
CreatorsFerizis, George, Computer Science & Engineering, Faculty of Engineering, UNSW
PublisherAwarded by:University of New South Wales. Computer Science and Engineering
Source SetsAustraliasian Digital Theses Program
LanguageEnglish
Detected LanguageEnglish
RightsCopyright George Ferizis, http://unsworks.unsw.edu.au/copyright

Page generated in 0.0024 seconds