FloatingPoint vs. FixedPoint Numbers
• Fixed point has limitations
x = 0000 0000. 0000 1001_{2}
y = 1001 0000.
0000 0000_{2}
Rounding?
Overflow? (x^{2}
and y^{2} under/overflow)
• Floating point: represent numbers in two fixed width
fields: “magnitude” and “exponent”
Magnitude: more
bits = more accuracy
Exponent: more
bits = wider range of numbers

± 
Exponent 
Magnitude 
X = 


Floating Point Number Representation
• Sign field:
When 0:
positive number, when 1, negative
• Exponent:
Usually
presented as unsigned by adding an offset
Example: 4 bits
of exponent, offset=8
• Magnitude (also called significand, mantissa)
Shift the
number to get: 1.xxxx
Magnitude is
the fractional part (hidden ‘1’)
Example: 6 bits
of mantissa
o Number=110.0101 shift: 1.100101
mantissa=100101
o Number=0.0001011 shift: 1.011
mantissa=011000 
Floating Point Numbers: Example

± 
Exponent 
Magnitude 


Floating Point Number Range
• Range: [max, min] U [min, max]
Min = smallest
magnitude
Max = largest
magnitude
• What happens if:
We increase #
bits for exponent?
Increase # bits
for magnitude?

Floating Point Operations
• Addition/subtraction, multiplication/division,
function evaluations, ...
• Basic operations
Adding
exponents / magnitudes
Multiplying
magnitudes
Aligning
magnitudes (shifting, adjusting the
exponent)
Rounding
Checking for
overflow/underflow
Normalization
(shifting, adjusting the exponent) 
Floating Point Addition
• More difficult than multiplication!
• Operations:
Align magnitudes (so that exponents are equal )
Add (and round)
Normalize (result in the form of 1.xxx)

Floating Point Adder Architecture

Floating Point Adder Components
• Unpacking
Inserting the “hidden 1”
Checking for special inputs (NaN, zero)
• Exponent difference
Used in aligning the magnitudes
A few bits enough for subtraction
o If 32bit magnitude adder, 8 bits of exponent, only 5 bits
involved in subtraction
If negative difference , swap, use positive diff
o How to compute the positive diff?
• Preshifting and swap
Shift/complement provided for one operand only
Swap if needed 
• Rounding
Three extra bits used for rounding
• Postshifting
Result in the range (4, 4)
Right shift: 1 bit max
o If right shift
Left shift: up to # of bits in magnitude
o Determine # of consecutive 0’s (1’s) in z, beginning with z_{1}.
Adjust exponent accordingly
• Packing
Check for special results (zero, under/overflow)
Remove the hidden 1 
Counting vs. Predicting Leading Zeros/Ones


Counting:
Simpler but on the
critical path 
Predicting:
More complex
architecture 

Floating Point Multiplication
• Simpler than floatingpoint addition
• Operation:
Inputs:
Output =
Sign: XOR
Exponent:
o Tentatively computed as e1+e2
o Subtract the bias (=127) HOW?
o Adjusted after normalization
Magnitude
o Result in the range [1,4) (inputs in the range [1,2) )
o Normalization: 1 or 2bit shift right, depending on rounding
o Result is 2.(1+m) bits, should be rounded to (1+m) bits
o Rounding can gradually discard bits, instead of one last stage 
Floating Point Multiplier Architecture
Note:
Pipelining is
used in
magnitude
multiplier, as
well as block
boundaries 


SquareRooting • Most
important elementary function
• In IEEE standard, specified a basic operation
(alongside +,,*,/)
• Very similar to division
• Pencilandpaper method:
Radicand:
Square root :
Remainder 
Square Rooting: Example
• Example: sqrt(9 52 41)

• Why double the partial root?
Partial root after step 2 is:
Appending the next digit
Square of which is 1
The term already subtracted
Find q_{0} such that is the
max number ≤ partial remainder
• The binary case:
Square of is:
Find q0 such that is ≤ partial
remainder
For the expression becomes
(i.e.,
append “01” to the partial root) 
Square Rooting: Example Base 2
• Example: sqrt(011101102) = sqrt(118)

Sequential Shift/Subtract Square Rooter
Architecture

Other Methods for Square Rooting
• Restoring vs. nonrestoring
We looked at the restoring algorithm
(after subtraction, restore partial remainder if the
result is negative)
Nonrestoring:
Use a different encoding (use digits {1,1} instead of
{0,1}) to avoid restoring
• Highradix
Similar to modified Booth encoding multiplication: take
care of more number of bits at a time
More complex circuit , but faster 
• Convergence methods
Use the Newton method to approximate the function
approximates
OR
approximates
,
multiply by z to get
Iteratively improve the accuracy
Can use lookup table for the first iteration 
Square Rooting: Abstract Notation
Floating point format:
 Shift left (not right)
 Powers of 2 decreasing 
Restoring FloatingPoint Square Root Calc.


Nonrestoring FloatingPoint Square Root Calc.

If final S negative, drop the last ‘1’ in q, and
restore the
remainder to the last positive value. 
Square Root Through Convergence
• NewtonRapson method:
Choose
• Example: compute square root of
z=(2.4)_{10}
read
out from table = 1.5 
accurate to 

accurate to 

accurate to 

accurate to 

NonRestoring Parallel Square Rooter

Function Evaluation
•
We looked at square root calculation
Direct hardware implementation (binary, BSD, highradix)
o Serial
o Parallel
Approximation (Newton method)
• What about other functions?
Direct implementation
o Example: can be directly
implemented in hardware
(using square root as a subcomponent)
Polynomial approximation
Table lookup
o Either as part of calculation or for the full calculation 
Table Lookup


Direct tablelookup
implementation 
Tablelookup with preand
postprocessing 

Linear Interpolation Using Four Subinterval

Piecewise Table Lookup

Accuracy vs. Lookup Table Size Tradeoff
