Spelling suggestions: "subject:"floating point"" "subject:"bloating point""
1 |
Design tradeoff analysis of floating-point adder in FPGAsMalik, Ali 19 August 2005 (has links)
Field Programmable Gate Arrays (FPGA) are increasingly being used to design high end computationally intense microprocessors capable of handling both fixed and floating-point mathematical operations. Addition is the most complex operation in a floating-point unit and offers major delay while taking significant area. Over the years, the VLSI community has developed many floating-point adder algorithms mainly aimed to reduce the overall latency.
An efficient design of floating-point adder onto an FPGA offers major area and performance overheads. With the recent advancement in FPGA architecture and area density, latency has been the main focus of attention in order to improve performance. Our research was oriented towards studying and implementing standard, Leading One
Predictor (LOP), and far and close data-path floating-point addition algorithms. Each algorithm has complex sub-operations which lead significantly to overall latency of the design. Each of the sub-operation is researched for different implementations and then synthesized onto a Xilinx Virtex2p FPGA device to be chosen for best performance.
This thesis discusses in detail the best possible FPGA implementation for all the three algorithms and will act as an important design resource. The performance criterion is latency in all the cases. The algorithms are compared for overall latency, area, and levels of logic and analyzed specifically for Virtex2p architecture, one of the latest FPGA architectures provided by Xilinx. According to our results standard algorithm is the best implementation with respect to area but has overall large latency of 27.059 ns while occupying 541 slices. LOP algorithm improves latency by 6.5% on added expense of 38% area compared to standard algorithm. Far and close data-path implementation shows 19% improvement in latency on added expense of 88% in area compared to standard algorithm. The results clearly show that for area efficient design standard algorithm is the best choice but for designs where latency is the criteria of performance far and close data-path is the best alternative. The standard and LOP algorithms were pipelined into five stages and compared with the Xilinx Intellectual Property. The pipelined LOP gives 22% better clock speed on an added expense of 15% area when compared to Xilinx Intellectual Property and thus a better choice for higher throughput applications. Test benches were also developed to test these algorithms both in simulation and hardware.
Our work is an important design resource for development of floating-point adder hardware on FPGAs. All sub components within the floating-point adder and known algorithms are researched and implemented to provide versatility and flexibility to designers as an alternative to intellectual property where they have no control over the design. The VHDL code is open source and can be used by designers with proper reference.
|
2 |
Design tradeoff analysis of floating-point adder in FPGAsMalik, Ali 19 August 2005
Field Programmable Gate Arrays (FPGA) are increasingly being used to design high end computationally intense microprocessors capable of handling both fixed and floating-point mathematical operations. Addition is the most complex operation in a floating-point unit and offers major delay while taking significant area. Over the years, the VLSI community has developed many floating-point adder algorithms mainly aimed to reduce the overall latency.
An efficient design of floating-point adder onto an FPGA offers major area and performance overheads. With the recent advancement in FPGA architecture and area density, latency has been the main focus of attention in order to improve performance. Our research was oriented towards studying and implementing standard, Leading One
Predictor (LOP), and far and close data-path floating-point addition algorithms. Each algorithm has complex sub-operations which lead significantly to overall latency of the design. Each of the sub-operation is researched for different implementations and then synthesized onto a Xilinx Virtex2p FPGA device to be chosen for best performance.
This thesis discusses in detail the best possible FPGA implementation for all the three algorithms and will act as an important design resource. The performance criterion is latency in all the cases. The algorithms are compared for overall latency, area, and levels of logic and analyzed specifically for Virtex2p architecture, one of the latest FPGA architectures provided by Xilinx. According to our results standard algorithm is the best implementation with respect to area but has overall large latency of 27.059 ns while occupying 541 slices. LOP algorithm improves latency by 6.5% on added expense of 38% area compared to standard algorithm. Far and close data-path implementation shows 19% improvement in latency on added expense of 88% in area compared to standard algorithm. The results clearly show that for area efficient design standard algorithm is the best choice but for designs where latency is the criteria of performance far and close data-path is the best alternative. The standard and LOP algorithms were pipelined into five stages and compared with the Xilinx Intellectual Property. The pipelined LOP gives 22% better clock speed on an added expense of 15% area when compared to Xilinx Intellectual Property and thus a better choice for higher throughput applications. Test benches were also developed to test these algorithms both in simulation and hardware.
Our work is an important design resource for development of floating-point adder hardware on FPGAs. All sub components within the floating-point adder and known algorithms are researched and implemented to provide versatility and flexibility to designers as an alternative to intellectual property where they have no control over the design. The VHDL code is open source and can be used by designers with proper reference.
|
3 |
Hybrid Floating-point Units in FPGAs / Hybrida flyttalsenheter i FPGA:erEnglund, Madeleine January 2012 (has links)
Floating point numbers are used in many applications that would be well suited to a higher parallelism than that offered in a CPU. In these cases, an FPGA, with its ability to handle multiple calculations simultaneously, could be the solution. Unfortunately, floating point operations which are implemented in an FPGA is often resource intensive, which means that many developers avoid floating point solutions in FPGAs or using FPGAs for floating point applications. Here the potential to get less expensive floating point operations by using ahigher radix for the floating point numbers and using and expand the existingDSP block in the FPGA is investigated. One of the goals is that the FPGAshould be usable for both the users that have floating point in their designsand those who do not. In order to motivate hard floating point blocks in theFPGA, these must not consume too much of the limited resources. This work shows that the floating point addition will become smaller withthe use of the higher radix, while the multiplication becomes smaller by usingthe hardware of the DSP block. When both operations are examined at the sametime, it turns out that it is possible to get a reduced area, compared toseparate floating point units, by utilizing both the DSP block and higherradix for the floating point numbers.
|
4 |
Designs, Implementations and Applications of Floating-Point Trigonometric Function UnitsLee, Hsin-mau 02 September 2008 (has links)
In addition to the previous pipelined floating-point CORDIC design, three different architectures supporting both CORDIC rotation mode and vectoring mode are proposed in this thesis. Detailed analysis and comparison of these architectures are addressed in order to choose the best architecture with minimized area cost and computation latency given the required bit accuracy. Based on the comparison, we have chosen the best architecture and implemented an IEEE single precision floating-point CORDIC processor. The mathematical analysis of the computation errors is done to minimize the bit width of the composing arithmetic components during implementation. The comparison results of different architectures also serve as a general guideline for the design of floating-point sine/cosine units. Finally, we study the application of the floating-point CORDIC to 3D graphics acceleration.
|
5 |
Automatic synthesis and optimization of floating point hardware.January 2003 (has links)
Ho Chun Hok. / Thesis (M.Phil.)--Chinese University of Hong Kong, 2003. / Includes bibliographical references (leaves 74-78). / Abstracts in English and Chinese. / Abstract --- p.ii / Acknowledgement --- p.v / Chapter 1 --- Introduction --- p.1 / Chapter 1.1 --- Motivation --- p.1 / Chapter 1.2 --- Aims --- p.3 / Chapter 1.3 --- Contributions --- p.3 / Chapter 1.4 --- Thesis Organization --- p.4 / Chapter 2 --- Background and Literature Review --- p.5 / Chapter 2.1 --- Introduction --- p.5 / Chapter 2.2 --- Field Programmable Gate Arrays --- p.5 / Chapter 2.3 --- Traditional design flow and VHDL --- p.6 / Chapter 2.4 --- Single Description for Hardware-Software Systems --- p.7 / Chapter 2.5 --- Parameterized Floating Point Arithmetic Implementation --- p.8 / Chapter 2.6 --- Function Approximations by Table Lookup and Addition --- p.9 / Chapter 2.7 --- Summary --- p.10 / Chapter 3 --- Floating Point Arithmetic --- p.11 / Chapter 3.1 --- Introduction --- p.11 / Chapter 3.2 --- Floating Point Number Representation --- p.11 / Chapter 3.3 --- Rounding Error --- p.12 / Chapter 3.4 --- Floating Point Number Arithmetic --- p.14 / Chapter 3.4.1 --- Addition and Subtraction --- p.14 / Chapter 3.4.2 --- Multiplication --- p.17 / Chapter 3.5 --- Summary --- p.17 / Chapter 4 --- FLY - Hardware Compiler --- p.18 / Chapter 4.1 --- Introduction --- p.18 / Chapter 4.2 --- The Fly Programming Language --- p.18 / Chapter 4.3 --- Implementation details --- p.19 / Chapter 4.3.1 --- Compilation Technique --- p.19 / Chapter 4.3.2 --- Statement --- p.21 / Chapter 4.3.3 --- Assignment --- p.21 / Chapter 4.3.4 --- Conditional Branch --- p.22 / Chapter 4.3.5 --- While --- p.22 / Chapter 4.3.6 --- Parallel Statement --- p.22 / Chapter 4.4 --- Development Environment --- p.24 / Chapter 4.4.1 --- From Fly to Bitstream --- p.24 / Chapter 4.4.2 --- Host Interface --- p.24 / Chapter 4.5 --- Summary --- p.26 / Chapter 5 --- Float - Floating Point Design Environment --- p.27 / Chapter 5.1 --- Introduction --- p.27 / Chapter 5.2 --- Floating Point Tools --- p.28 / Chapter 5.2.1 --- Float Class --- p.29 / Chapter 5.2.2 --- Optimization --- p.31 / Chapter 5.3 --- Digital Sine-Cosine Generator --- p.33 / Chapter 5.4 --- VHDL Floating Point operator generator --- p.35 / Chapter 5.4.1 --- Floating Point Multiplier Module --- p.35 / Chapter 5.4.2 --- Floating Point Adder Module --- p.36 / Chapter 5.5 --- Application to Solving Differential Equations --- p.38 / Chapter 5.6 --- Summary --- p.40 / Chapter 6 --- Function Approximation using Lookup Table --- p.42 / Chapter 6.1 --- Table Lookup Approximations --- p.42 / Chapter 6.1.1 --- Taylor Expansion --- p.42 / Chapter 6.1.2 --- Symmetric Bipartite Table Method (SBTM) --- p.43 / Chapter 6.1.3 --- Symmetric Table Addition Method (STAM) --- p.45 / Chapter 6.1.4 --- Input Range Scaling --- p.46 / Chapter 6.2 --- VHDL Extension --- p.47 / Chapter 6.3 --- Floating Point Extension --- p.49 / Chapter 6.4 --- The N-body Problem --- p.52 / Chapter 6.5 --- Implementation --- p.54 / Chapter 6.6 --- Summary --- p.56 / Chapter 7 --- Results --- p.58 / Chapter 7.1 --- Introduction --- p.58 / Chapter 7.2 --- GCD coprocessor --- p.58 / Chapter 7.3 --- Floating Point Module Library --- p.59 / Chapter 7.4 --- Digital sine-cosine generator (DSCG) --- p.60 / Chapter 7.5 --- Optimization --- p.62 / Chapter 7.6 --- Ordinary Differential Equation (ODE) --- p.63 / Chapter 7.7 --- N Body Problem Simulation (Nbody) --- p.63 / Chapter 7.8 --- Summary --- p.64 / Chapter 8 --- Conclusion --- p.66 / Chapter 8.1 --- Future Work --- p.68 / Chapter A --- Fly Formal Grammar --- p.70 / Chapter B --- Original Fly Source Code --- p.71 / Bibliography --- p.74
|
6 |
Implementation of Pipeline Floating-Point CORDIC Processor and its Error Analysis and ApplicationsYang, Chih-yu 19 August 2007 (has links)
In this thesis, the traditional fixed-point CORDIC algorithm is extended to floating-point version in order to calculate transcendental functions (such as sine/cosine, logarithm, powering function, etc.) with high accuracy and large range. Based on different algorithm derivations, two different floating-point high-throughput pipelined CORDIC architectures are proposed. The first architecture adopts barrel shifters to implement the shift operations in each pipelined stage. The second architecture uses pure hardwired method for the shifting operations. Another key contribution of this thesis is to analyze the execution errors in the floating-point CORDIC architectures and make comparison with the execution resulting from pure software programs. Finally, the thesis applies the floating-point CORDIC to realizing the rotation-related operations required in 3D graphics applications.
|
7 |
Customization of floating-point units for embedded systems and field programmable gate arraysChong, Michael Yee Jern, Computer Science & Engineering, Faculty of Engineering, UNSW January 2009 (has links)
While Application Specific Instruction Set Processors (ASIPs) have allowed designers to create processors with custom instructions to target specific applications, floating-point units (FPUs) are still instantiated as non-customizable general-purpose units, which if under utilized, wastes area and performance. However, customizing FPUs manually is a complex and time-consuming process. Therefore, there is a need for an automated custom FPU generation scheme. This thesis presents a methodology for generating application-specific FPUs customized at the instruction level, with integrated datapath merging to minimize area. The methodology reduces the subset of floating-point instructions implemented to the minimum required for the application. Datapath merging is then performed on the required datapaths to minimize area. Previous datapath merging techniques failed to consider merging components of different bit-widths and thus ignore the bit-alignment problem in datapath merging. This thesis presents a novel bit-alignment solution during datapath merging. In creating the custom FPU, the subset of floating-point instructions that should be implemented in hardware has to be determined. Implementing more instructions in hardware reduces the cycle count of the application, but may lead to increased delay due to multiplexers inserted on the critical path during datapath merging. A rapid design space exploration was performed to explore the trade-offs. By performing this exploration, a designer could determine the number of instructions that should be implemented as a custom FPU and the number that should be left for software emulation, such that performance and area meets the designer's requirements. Customized FPUs were generated for different Mediabench applications and compared to a fully-featured reference FPU that implemented all floating-point operations. Reducing the floating-point instruction set reduced the FPU area by an average of 55%. Performing instruction reduction and then datapath merging reduced the FPU area by an average of 68%. Experiments showed that datapath merging without bit-alignment achieved an average area reduction of 10.1%. With bit-alignment, an average of 16.5% was achieved. Bit-alignment proved most beneficial when there was a diverse mix of different bit-widths in the datapaths. Performance of Field-Programmable Gate Arrays (FPGAs) used for floating-point applications is poor due to the complexity of floating-point arithmetic. Implementing floating-point units on FPGAs consume a large amount of resources. Therefore, there is a need for embedded FPUs in FPGAs. However, if unutilized, they waste area on the FPGA die. To overcome this issue, a novel flexible multi-mode embedded FPU for FPGAs is presented in this thesis that can be configured to perform a wide range of operations. The floating-point adder and multiplier in the embedded FPU can each be configured to perform one double-precision operation or two single-precision operations in parallel. To increase flexibility further, access to the large integer multiplier, adder and shifters in the FPU is provided. It is also capable of floating-point and integer multiply-add operations. Benchmark circuits were implemented on both a standard Xilinx Virtex-II FPGA and on the FPGA with embedded FPU blocks. The implementations on the FPGA with embedded FPUs showed mean area and delay improvements of 5.2x and 5.8x respectively for the double-precision benchmarks, and 4.4x and 4.2x for the single-precision benchmarks.
|
8 |
Customization of floating-point units for embedded systems and field programmable gate arraysChong, Michael Yee Jern, Computer Science & Engineering, Faculty of Engineering, UNSW January 2009 (has links)
While Application Specific Instruction Set Processors (ASIPs) have allowed designers to create processors with custom instructions to target specific applications, floating-point units (FPUs) are still instantiated as non-customizable general-purpose units, which if under utilized, wastes area and performance. However, customizing FPUs manually is a complex and time-consuming process. Therefore, there is a need for an automated custom FPU generation scheme. This thesis presents a methodology for generating application-specific FPUs customized at the instruction level, with integrated datapath merging to minimize area. The methodology reduces the subset of floating-point instructions implemented to the minimum required for the application. Datapath merging is then performed on the required datapaths to minimize area. Previous datapath merging techniques failed to consider merging components of different bit-widths and thus ignore the bit-alignment problem in datapath merging. This thesis presents a novel bit-alignment solution during datapath merging. In creating the custom FPU, the subset of floating-point instructions that should be implemented in hardware has to be determined. Implementing more instructions in hardware reduces the cycle count of the application, but may lead to increased delay due to multiplexers inserted on the critical path during datapath merging. A rapid design space exploration was performed to explore the trade-offs. By performing this exploration, a designer could determine the number of instructions that should be implemented as a custom FPU and the number that should be left for software emulation, such that performance and area meets the designer's requirements. Customized FPUs were generated for different Mediabench applications and compared to a fully-featured reference FPU that implemented all floating-point operations. Reducing the floating-point instruction set reduced the FPU area by an average of 55%. Performing instruction reduction and then datapath merging reduced the FPU area by an average of 68%. Experiments showed that datapath merging without bit-alignment achieved an average area reduction of 10.1%. With bit-alignment, an average of 16.5% was achieved. Bit-alignment proved most beneficial when there was a diverse mix of different bit-widths in the datapaths. Performance of Field-Programmable Gate Arrays (FPGAs) used for floating-point applications is poor due to the complexity of floating-point arithmetic. Implementing floating-point units on FPGAs consume a large amount of resources. Therefore, there is a need for embedded FPUs in FPGAs. However, if unutilized, they waste area on the FPGA die. To overcome this issue, a novel flexible multi-mode embedded FPU for FPGAs is presented in this thesis that can be configured to perform a wide range of operations. The floating-point adder and multiplier in the embedded FPU can each be configured to perform one double-precision operation or two single-precision operations in parallel. To increase flexibility further, access to the large integer multiplier, adder and shifters in the FPU is provided. It is also capable of floating-point and integer multiply-add operations. Benchmark circuits were implemented on both a standard Xilinx Virtex-II FPGA and on the FPGA with embedded FPU blocks. The implementations on the FPGA with embedded FPUs showed mean area and delay improvements of 5.2x and 5.8x respectively for the double-precision benchmarks, and 4.4x and 4.2x for the single-precision benchmarks.
|
9 |
Fused floating-point arithmetic for DSPSaleh, Hani Hasan Mustafa, January 1900 (has links)
Thesis (Ph. D.)--University of Texas at Austin, 2009. / Title from PDF title page (University of Texas Digital Repository, viewed on Sept. 9, 2009). Vita. Includes bibliographical references.
|
10 |
Higher radix floating-point representations for FPGA-based arithmetic /Catanzaro, Bryan C. January 2005 (has links) (PDF)
Thesis (M.S.)--Brigham Young University. Dept. of Electrical and Computer Engineering, 2005. / Includes bibliographical references (p. 81-86).
|
Page generated in 0.1028 seconds