Floating-Point vs. Fixed-Point Numbers
• Fixed point has limitations
x = 0000 0000. 0000 10012
y = 1001 0000.
0000 00002
Rounding?
Overflow? (x2
and y2 under/overflow)
• Floating point: represent numbers in two fixed -width
fields: “magnitude” and “exponent”
Magnitude: more
bits = more accuracy
Exponent: more
bits = wider range of numbers
|
± |
Exponent |
Magnitude |
X = |
![](./articles_imgs/3516/floati1.gif) |
|
Floating Point Number Representation
• Sign field:
When 0:
positive number, when 1, negative
• Exponent:
Usually
presented as unsigned by adding an offset
Example: 4 bits
of exponent, offset=8
![](./articles_imgs/3516/floati59.gif)
• Magnitude (also called significand, mantissa)
Shift the
number to get: 1.xxxx
Magnitude is
the fractional part (hidden ‘1’)
Example: 6 bits
of mantissa
o Number=110.0101 shift: 1.100101
mantissa=100101
o Number=0.0001011 shift: 1.011
mantissa=011000 |
Floating Point Numbers: Example
|
± |
Exponent |
Magnitude |
![](./articles_imgs/3516/floati2.jpg) |
|
Floating Point Number Range
• Range: [-max, -min] U [min, max]
Min = smallest
magnitude![](./articles_imgs/3516/floati61.gif)
Max = largest
magnitude ![](./articles_imgs/3516/floati62.gif)
• What happens if:
We increase #
bits for exponent?
Increase # bits
for magnitude?
![](./articles_imgs/3516/floati3.jpg)
|
Floating Point Operations
• Addition/subtraction, multiplication/division,
function evaluations, ...
• Basic operations
Adding
exponents / magnitudes
Multiplying
magnitudes
Aligning
magnitudes (shifting, adjusting the
exponent)
Rounding
Checking for
overflow/underflow
Normalization
(shifting, adjusting the exponent) |
Floating Point Addition
• More difficult than multiplication!
• Operations:
Align magnitudes (so that exponents are equal )
Add (and round)
Normalize (result in the form of 1.xxx)
![](./articles_imgs/3516/floati4.jpg)
|
Floating Point Adder Architecture
![](./articles_imgs/3516/floati5.jpg) |
Floating Point Adder Components
• Unpacking
Inserting the “hidden 1”
Checking for special inputs (NaN, zero)
• Exponent difference
Used in aligning the magnitudes
A few bits enough for subtraction
o If 32-bit magnitude adder, 8 bits of exponent, only 5 bits
involved in subtraction
If negative difference , swap, use positive diff
o How to compute the positive diff?
• Pre-shifting and swap
Shift/complement provided for one operand only
Swap if needed |
• Rounding
Three extra bits used for rounding
• Post-shifting
Result in the range (-4, 4) ![](./articles_imgs/3516/floati6.gif)
Right shift: 1 bit max
o If right shift
Left shift: up to # of bits in magnitude
o Determine # of consecutive 0’s (1’s) in z, beginning with z1.
Adjust exponent accordingly
• Packing
Check for special results (zero, under-/overflow)
Remove the hidden 1 |
Counting vs. Predicting Leading Zeros/Ones
![](./articles_imgs/3516/floati8.jpg) |
![](./articles_imgs/3516/floati9.jpg) |
Counting:
Simpler but on the
critical path |
Predicting:
More complex
architecture |
|
Floating Point Multiplication
• Simpler than floating-point addition
• Operation:
Inputs:![](./articles_imgs/3516/floati10.gif)
Output = ![](./articles_imgs/3516/floati11.gif)
Sign: XOR
Exponent:
o Tentatively computed as e1+e2
o Subtract the bias (=127) HOW?
o Adjusted after normalization
Magnitude
o Result in the range [1,4) (inputs in the range [1,2) )
o Normalization: 1- or 2-bit shift right, depending on rounding
o Result is 2.(1+m) bits, should be rounded to (1+m) bits
o Rounding can gradually discard bits, instead of one last stage |
Floating Point Multiplier Architecture
Note:
Pipelining is
used in
magnitude
multiplier, as
well as block
boundaries |
![](./articles_imgs/3516/floati12.jpg) |
|
Square-Rooting • Most
important elementary function
• In IEEE standard, specified a basic operation
(alongside +,-,*,/)
• Very similar to division
• Pencil-and-paper method:
Radicand: ![](./articles_imgs/3516/floati13.gif)
Square root : ![](./articles_imgs/3516/floati14.gif)
Remainder ![](./articles_imgs/3516/floati15.jpg) |
Square Rooting: Example
• Example: sqrt(9 52 41)
![](./articles_imgs/3516/floati16.jpg)
|
• Why double the partial root?
Partial root after step 2 is:![](./articles_imgs/3516/floati17.gif)
Appending the next digit ![](./articles_imgs/3516/floati18.gif)
Square of which is 1![](./articles_imgs/3516/floati19.gif)
The term already subtracted
Find q0 such that is the
max number ≤ partial remainder
• The binary case:
Square of is:
![](./articles_imgs/3516/floati23.gif)
Find q0 such that is ≤ partial
remainder
For the expression becomes
(i.e.,
append “01” to the partial root) |
Square Rooting: Example Base 2
• Example: sqrt(011101102) = sqrt(118)
![](./articles_imgs/3516/floati27.jpg)
|
Sequential Shift/Subtract Square Rooter
Architecture
![](./articles_imgs/3516/floati28.jpg) |
Other Methods for Square Rooting
• Restoring vs. non-restoring
We looked at the restoring algorithm
(after subtraction, restore partial remainder if the
result is negative)
Non-restoring:
Use a different encoding (use digits {-1,1} instead of
{0,1}) to avoid restoring
• High-radix
Similar to modified Booth encoding multiplication: take
care of more number of bits at a time
More complex circuit , but faster |
• Convergence methods
Use the Newton method to approximate the function
approximates![](./articles_imgs/3516/floati30.gif)
OR
approximates
,
multiply by z to get ![](./articles_imgs/3516/floati33.gif)
Iteratively improve the accuracy
Can use lookup table for the first iteration |
Square Rooting: Abstract Notation
![](./articles_imgs/3516/floati34.jpg)
Floating point format:
- Shift left (not right)
- Powers of 2 decreasing |
Restoring Floating-Point Square Root Calc.
![](./articles_imgs/3516/floati35.jpg) |
![](./articles_imgs/3516/floati36.jpg) |
Nonrestoring Floating-Point Square Root Calc.
![](./articles_imgs/3516/floati37.jpg) |
![](./articles_imgs/3516/floati38.jpg)
If final S negative, drop the last ‘1’ in q, and
restore the
remainder to the last positive value. |
Square Root Through Convergence
• Newton-Rapson method:
Choose![](./articles_imgs/3516/floati39.gif)
![](./articles_imgs/3516/floati40.gif)
• Example: compute square root of
z=(2.4)10
read
out from table = 1.5 |
accurate to![](./articles_imgs/3516/floati42.gif) |
![](./articles_imgs/3516/floati43.gif) |
accurate to![](./articles_imgs/3516/floati44.gif) |
![](./articles_imgs/3516/floati45.gif) |
accurate to![](./articles_imgs/3516/floati46.gif) |
![](./articles_imgs/3516/floati47.gif) |
accurate to![](./articles_imgs/3516/floati48.gif) |
|
Non-Restoring Parallel Square Rooter
![](./articles_imgs/3516/floati49.jpg) |
Function Evaluation
•
We looked at square root calculation
Direct hardware implementation (binary, BSD, high-radix)
o Serial
o Parallel
Approximation (Newton method)
• What about other functions?
Direct implementation
o Example: can be directly
implemented in hardware
(using square root as a sub-component)
Polynomial approximation
Table look-up
o Either as part of calculation or for the full calculation |
Table Lookup
![](./articles_imgs/3516/floati51.jpg) |
![](./articles_imgs/3516/floati52.jpg) |
Direct table-lookup
implementation |
Table-lookup with pre-and
post-processing |
|
Linear Interpolation Using Four Subinterval
|
Piecewise Table Lookup
|
Accuracy vs. Lookup Table Size Trade-off
![](./articles_imgs/3516/floati57.jpg) |