Spelling suggestions: "subject:"floating point arithmetic"" "subject:"bloating point arithmetic""
1 |
Automatic synthesis and optimization of floating point hardware.January 2003 (has links)
Ho Chun Hok. / Thesis (M.Phil.)--Chinese University of Hong Kong, 2003. / Includes bibliographical references (leaves 74-78). / Abstracts in English and Chinese. / Abstract --- p.ii / Acknowledgement --- p.v / Chapter 1 --- Introduction --- p.1 / Chapter 1.1 --- Motivation --- p.1 / Chapter 1.2 --- Aims --- p.3 / Chapter 1.3 --- Contributions --- p.3 / Chapter 1.4 --- Thesis Organization --- p.4 / Chapter 2 --- Background and Literature Review --- p.5 / Chapter 2.1 --- Introduction --- p.5 / Chapter 2.2 --- Field Programmable Gate Arrays --- p.5 / Chapter 2.3 --- Traditional design flow and VHDL --- p.6 / Chapter 2.4 --- Single Description for Hardware-Software Systems --- p.7 / Chapter 2.5 --- Parameterized Floating Point Arithmetic Implementation --- p.8 / Chapter 2.6 --- Function Approximations by Table Lookup and Addition --- p.9 / Chapter 2.7 --- Summary --- p.10 / Chapter 3 --- Floating Point Arithmetic --- p.11 / Chapter 3.1 --- Introduction --- p.11 / Chapter 3.2 --- Floating Point Number Representation --- p.11 / Chapter 3.3 --- Rounding Error --- p.12 / Chapter 3.4 --- Floating Point Number Arithmetic --- p.14 / Chapter 3.4.1 --- Addition and Subtraction --- p.14 / Chapter 3.4.2 --- Multiplication --- p.17 / Chapter 3.5 --- Summary --- p.17 / Chapter 4 --- FLY - Hardware Compiler --- p.18 / Chapter 4.1 --- Introduction --- p.18 / Chapter 4.2 --- The Fly Programming Language --- p.18 / Chapter 4.3 --- Implementation details --- p.19 / Chapter 4.3.1 --- Compilation Technique --- p.19 / Chapter 4.3.2 --- Statement --- p.21 / Chapter 4.3.3 --- Assignment --- p.21 / Chapter 4.3.4 --- Conditional Branch --- p.22 / Chapter 4.3.5 --- While --- p.22 / Chapter 4.3.6 --- Parallel Statement --- p.22 / Chapter 4.4 --- Development Environment --- p.24 / Chapter 4.4.1 --- From Fly to Bitstream --- p.24 / Chapter 4.4.2 --- Host Interface --- p.24 / Chapter 4.5 --- Summary --- p.26 / Chapter 5 --- Float - Floating Point Design Environment --- p.27 / Chapter 5.1 --- Introduction --- p.27 / Chapter 5.2 --- Floating Point Tools --- p.28 / Chapter 5.2.1 --- Float Class --- p.29 / Chapter 5.2.2 --- Optimization --- p.31 / Chapter 5.3 --- Digital Sine-Cosine Generator --- p.33 / Chapter 5.4 --- VHDL Floating Point operator generator --- p.35 / Chapter 5.4.1 --- Floating Point Multiplier Module --- p.35 / Chapter 5.4.2 --- Floating Point Adder Module --- p.36 / Chapter 5.5 --- Application to Solving Differential Equations --- p.38 / Chapter 5.6 --- Summary --- p.40 / Chapter 6 --- Function Approximation using Lookup Table --- p.42 / Chapter 6.1 --- Table Lookup Approximations --- p.42 / Chapter 6.1.1 --- Taylor Expansion --- p.42 / Chapter 6.1.2 --- Symmetric Bipartite Table Method (SBTM) --- p.43 / Chapter 6.1.3 --- Symmetric Table Addition Method (STAM) --- p.45 / Chapter 6.1.4 --- Input Range Scaling --- p.46 / Chapter 6.2 --- VHDL Extension --- p.47 / Chapter 6.3 --- Floating Point Extension --- p.49 / Chapter 6.4 --- The N-body Problem --- p.52 / Chapter 6.5 --- Implementation --- p.54 / Chapter 6.6 --- Summary --- p.56 / Chapter 7 --- Results --- p.58 / Chapter 7.1 --- Introduction --- p.58 / Chapter 7.2 --- GCD coprocessor --- p.58 / Chapter 7.3 --- Floating Point Module Library --- p.59 / Chapter 7.4 --- Digital sine-cosine generator (DSCG) --- p.60 / Chapter 7.5 --- Optimization --- p.62 / Chapter 7.6 --- Ordinary Differential Equation (ODE) --- p.63 / Chapter 7.7 --- N Body Problem Simulation (Nbody) --- p.63 / Chapter 7.8 --- Summary --- p.64 / Chapter 8 --- Conclusion --- p.66 / Chapter 8.1 --- Future Work --- p.68 / Chapter A --- Fly Formal Grammar --- p.70 / Chapter B --- Original Fly Source Code --- p.71 / Bibliography --- p.74
|
2 |
Fused floating-point arithmetic for DSPSaleh, Hani Hasan Mustafa, January 1900 (has links)
Thesis (Ph. D.)--University of Texas at Austin, 2009. / Title from PDF title page (University of Texas Digital Repository, viewed on Sept. 9, 2009). Vita. Includes bibliographical references.
|
3 |
Higher radix floating-point representations for FPGA-based arithmetic /Catanzaro, Bryan C. January 2005 (has links) (PDF)
Thesis (M.S.)--Brigham Young University. Dept. of Electrical and Computer Engineering, 2005. / Includes bibliographical references (p. 81-86).
|
4 |
Fused floating-point arithmetic for DSPSaleh, Hani Hasan Mustafa, 1970- 16 October 2012 (has links)
Floating-point arithmetic is attractive for the implementation for a variety of Digital Signal Processing (DSP) applications because it allows the designer and user to concentrate on the algorithms and architecture without worrying about numerical issues. In the past, many DSP applications used fixed point arithmetic due to the high cost (in delay, silicon area, and power consumption) of floating-point arithmetic units. In the realization of modern general purpose processors, fused floating-point multiply add units have become attractive since their delay and silicon area is often less than that of a discrete floating-point multiplier followed by a floating point adder. Further the accuracy is improved by the fused implementation since rounding is performed only once (after the multiplication and addition). This work extends the consideration of fused floating-point arithmetic to operations that are frequently encountered in DSP. The Fast Fourier Transform is a case in point since it uses a complex butterfly operation. For a radix-2 implementation, the butterfly consists of a complex multiply and the complex addition and subtraction of the same pair of data. For a radix-4 implementation, the butterfly consists of three complex multiplications and eight complex additions and subtractions. Both of these butterfly operations can be implemented with two fused primitives, a fused two-term dot-product unit and a fused add-subtract unit. The fused two-term dot-product multiplies two sets of operands and adds the products as a single operation. The two products do not need to be rounded (only the sum is normalized and rounded) which reduces the delay by about 15% while reducing the silicon area by about 33%. For the add-subtract unit, much of the complexity of a discrete implementation comes from the need to compare the operand exponents and align the significands prior to the add and the subtract operations. For the fused implementation, sharing the comparison and alignment greatly reduces the complexity. The delay and the arithmetic results are the same as if the operations are performed in the conventional manner with a floating-point adder and a separate floating-point subtracter. In this case, the fused implementation is about 20% smaller than the discrete equivalent. / text
|
5 |
Fused floating-point arithmetic for application specific processorsMin, Jae Hong 25 February 2014 (has links)
Floating-point computer arithmetic units are used for modern-day computers for 2D/3D graphic and scientific applications due to their wider dynamic range than a fixed-point number system with the same word-length. However, the floating-point arithmetic unit has larger area, power consumption, and latency than a fixed-point arithmetic unit. It has become a big issue in modern low-power processors due to their limited power and performance margins. Therefore, fused architectures have been developed to improve floating-point operations. This dissertation introduces new improved fused architectures for add-subtract, sum-of-squares, and magnitude operations for graphics, scientific, and signal processing.
A low-power dual-path fused floating-point add-subtract unit is introduced and compared with previous fused add-subtract units such as the single path and the high-speed dual-path fused add-subtract unit. The high-speed dual-path fused add-subtract unit has less latency compared with the single-path unit at a cost of large power consumption. To reduce the power consumption, an alternative dual-path architecture is applied to the fused add-subtract unit. The significand addition, subtraction and round units are performed after the far/close path. The power consumption of the proposed design is lower than the high-speed dual-path fused add-subtract unit at a cost in latency; however, the proposed fused unit is faster than the single-path fused unit.
High-performance and low-power floating-point fused architectures for a two-term sum-of-squares computation are introduced and compared with discrete units. The fused architectures include pre/post-alignment, partial carry-sum width, and enhanced rounding. The fused floating-point sum-of-squares units with the post-alignment, 26 bit partial carry-sum width, and enhanced rounding system have less power-consumption, area, and latency compared with discrete parallel dot-product and sum-of-squares units. Hardware tradeoffs are presented between the fused designs in terms of power consumption, area, and latency. For example, the enhanced rounding processing reduces latency with a moderate cost of increased power consumption and area.
A new type of fused architecture for magnitude computation with less power consumption, area, and latency than conventional discrete floating-point units is proposed. Compared with the discrete parallel magnitude unit realized with conventional floating-point squarers, an adder, and a square-root unit, the fused floating-point magnitude unit has less area, latency, and power consumption. The new design includes new designs for enhanced exponent, compound add/round, and normalization units. In addition, a pipelined structure for the fused magnitude unit is shown. / text
|
6 |
SIMULINK modules that emulate digital controllers realized with fixed-point or floating-point arithmeticRobe, Edward D. January 1994 (has links)
Thesis (M.S.)--Ohio University, June, 1994. / Title from PDF t.p.
|
7 |
Voronoi diagrams robust and efficient implementation /Patel, Nirav B. January 2005 (has links)
Thesis (M.S.)--State University of New York at Binghamton, Department of Computer Science, 2005. / Includes bibliographical references.
|
8 |
Simulink <sup>TM</sup>modules that emulate digital controllers realized with fixed-point or floating-point arithmeticRobe, Edward D. January 1994 (has links)
No description available.
|
9 |
Quality Evaluation in Fixed-point Systems with Selective Simulation / Evaluation de la qualité des systèmes en virgule fixe avec la simulation sélectiveNehmeh, Riham 13 June 2017 (has links)
Le temps de mise sur le marché et les coûts d’implantation sont les deux critères principaux à prendre en compte dans l'automatisation du processus de conception de systèmes numériques. Les applications de traitement du signal utilisent majoritairement l'arithmétique virgule fixe en raison de leur coût d'implantation plus faible. Ainsi, une conversion en virgule fixe est nécessaire. Cette conversion est composée de deux parties correspondant à la détermination du nombre de bits pour la partie entière et pour la partie fractionnaire. Le raffinement d'un système en virgule fixe nécessite d'optimiser la largeur des données en vue de minimiser le coût d'implantation tout en évitant les débordements et un bruit de quantification excessif. Les applications dans les domaines du traitement d'image et du signal sont tolérantes aux erreurs si leur probabilité ou leur amplitude est suffisamment faible. De nombreux travaux de recherche se concentrent sur l'optimisation de la largeur de la partie fractionnaire sous contrainte de précision. La réduction du nombre de bits pour la partie fractionnaire conduit à une erreur d'amplitude faible par rapport à celle du signal. La théorie de la perturbation peut être utilisée pour propager ces erreurs à l'intérieur des systèmes à l'exception du cas des opérations un- smooth, comme les opérations de décision, pour lesquelles une erreur faible en entrée peut conduire à une erreur importante en sortie. De même, l'optimisation de la largeur de la partie entière peut réduire significativement le coût lorsque l'application est tolérante à une faible probabilité de débordement. Les débordements conduisent à une erreur d'amplitude élevée et leur occurrence doit donc être limitée. Pour l'optimisation des largeurs des données, le défi est d'évaluer efficacement l'effet des erreurs de débordement et de décision sur la métrique de qualité associée à l'application. L'amplitude élevée de l'erreur nécessite l'utilisation d'approches basées sur la simulation pour évaluer leurs effets sur la qualité. Dans cette thèse, nous visons à accélérer le processus d'évaluation de la métrique de qualité. Nous proposons un nouveau environnement logiciel utilisant des simulations sélectives pour accélérer la simulation des effets des débordements et des erreurs de décision. Cette approche peut être appliquée à toutes les applications de traitement du signal développées en langage C. Par rapport aux approches classiques basées sur la simulation en virgule fixe, où tous les échantillons d'entrée sont traités, l'approche proposée simule l'application uniquement en cas d'erreur. En effet, les dépassements et les erreurs de décision doivent être des événements rares pour maintenir la fonctionnalité du système. Par conséquent, la simulation sélective permet de réduire considérablement le temps requis pour évaluer les métriques de qualité des applications. De plus, nous avons travaillé sur l'optimisation de la largeur de la partie entière, qui peut diminuer considérablement le coût d'implantation lorsqu'une légère dégradation de la qualité de l'application est acceptable. Nous exploitons l'environnement logiciel proposé auparavant à travers un nouvel algorithme d'optimisation de la largeur des données. La combinaison de cet algorithme et de la technique de simulation sélective permet de réduire considérablement le temps d'optimisation. / Time-to-market and implementation cost are high-priority considerations in the automation of digital hardware design. Nowadays, digital signal processing applications use fixed-point architectures due to their advantages in terms of implementation cost. Thus, floating-point to fixed-point conversion is mandatory. The conversion process consists of two parts corresponding to the determination of the integer part word-length and the fractional part world-length. The refinement of fixed-point systems requires optimizing data word -length to prevent overflows and excessive quantization noises while minimizing implementation cost. Applications in image and signal processing domains are tolerant to errors if their probability or their amplitude is small enough. Numerous research works focus on optimizing the fractional part word-length under accuracy constraint. Reducing the number of bits for the fractional part word- length leads to a small error compared to the signal amplitude. Perturbation theory can be used to propagate these errors inside the systems except for unsmooth operations, like decision operations, for which a small error at the input can leads to a high error at the output. Likewise, optimizing the integer part word-length can significantly reduce the cost when the application is tolerant to a low probability of overflow. Overflows lead to errors with high amplitude and thus their occurrence must be limited. For the word-length optimization, the challenge is to evaluate efficiently the effect of overflow and unsmooth errors on the application quality metric. The high amplitude of the error requires using simulation based-approach to evaluate their effects on the quality. In this thesis, we aim at accelerating the process of quality metric evaluation. We propose a new framework using selective simulations to accelerate the simulation of overflow and un- smooth error effects. This approach can be applied on any C based digital signal processing applications. Compared to complete fixed -point simulation based approaches, where all the input samples are processed, the proposed approach simulates the application only when an error occurs. Indeed, overflows and unsmooth errors must be rare events to maintain the system functionality. Consequently, selective simulation allows reducing significantly the time required to evaluate the application quality metric. 1 Moreover, we focus on optimizing the integer part, which can significantly decrease the implementation cost when a slight degradation of the application quality is acceptable. Indeed, many applications are tolerant to overflows if the probability of overflow occurrence is low enough. Thus, we exploit the proposed framework in a new integer word-length optimization algorithm. The combination of the optimization algorithm and the selective simulation technique allows decreasing significantly the optimization time.
|
10 |
Improved architectures for a fused floating-point add-subtract unitSohn, Jongwook 27 February 2012 (has links)
This report presents improved architecture designs and implementations for a fused floating-point add-subtract unit. The fused floating-point add-subtract unit is useful for DSP applications such as FFT and DCT butterfly operations. To improve the performance of the fused floating-point add-subtract unit, the dual path algorithm and pipelining technique are applied. The proposed designs are implemented for both single and double precision and synthesized with a 45nm standard-cell library. The fused floating-point add-subtract unit saves 40% of the area and power consumption and the dual path fused floating-point add-subtract unit reduces the latency by 30% compared to the traditional discrete floating-point add-subtract unit. By combining fused operation and the dual path design, the proposed floating-point add-subtract unit achieves low area, low power consumption and high speed. Based on the data flow analysis, the proposed fused floating-point add-subtract unit is split into two pipeline stages. Since the latencies of two pipeline stages are fairly well balanced the throughput of the entire logic is increased by 80% compared to the non-pipelined implementation. / text
|
Page generated in 0.0754 seconds