English | Español

Try our Free Online Math Solver!

Online Math Solver












Please use this form if you would like
to have this math solver on your website,
free of charge.

Floating Point Arithmetic

Floating-Point vs. Fixed-Point Numbers

Fixed point has limitations
x = 0000 0000. 0000 10012
y = 1001 0000. 0000 00002
Overflow? (x2 and y2 under/overflow)

• Floating point: represent numbers in two fixed-width
fields: “magnitude” and “exponent”

Magnitude: more bits = more accuracy
Exponent: more bits = wider range of numbers
  ± Exponent Magnitude
X =
Floating Point Number Representation

Sign field :
When 0: positive number , when 1, negative

• Exponent:
Usually presented as unsigned by adding an offset
Example: 4 bits of exponent, offset=8

• Magnitude (also called significand, mantissa)
Shift the number to get: 1.xxxx
Magnitude is the fractional part (hidden ‘1’)
Example: 6 bits of mantissa
o Number=110.0101 shift: 1.100101 mantissa=100101
o Number=0.0001011 shift: 1.011 mantissa=011000
Floating Point Numbers: Example
  ± Exponent Magnitude
Floating Point Number Range

• Range: [-max, -min] U [min, max]
Min = smallest magnitude
Max = largest magnitude

• What happens if:
We increase # bits for exponent?
Increase # bits for magnitude?

Floating Point Operations

• Addition/subtraction, multiplication/division,
function evaluations, ...

• Basic operations

Adding exponents / magnitudes
Multiplying magnitudes
Aligning magnitudes (shifting, adjusting the
Checking for overflow/underflow
Normalization (shifting, adjusting the exponent)

Floating Point Addition

• More difficult than multiplication!

• Operations:

Align magnitudes (so that exponents are equal )
Add (and round)
Normalize (result in the form of 1.xxx)

Floating Point Adder Architecture

Floating Point Adder Components

• Unpacking
Inserting the “hidden 1”
Checking for special inputs (NaN, zero)

Exponent difference
Used in aligning the magnitudes
A few bits enough for subtraction
o If 32-bit magnitude adder, 8 bits of exponent, only 5 bits
involved in subtraction
If negative difference, swap, use positive diff
o How to compute the positive diff?

• Pre-shifting and swap
Shift/complement provided for one operand only
Swap if needed

• Rounding
Three extra bits used for rounding

• Post-shifting
Result in the range (-4, 4) 
Right shift: 1 bit max
o If right shift
Left shift: up to # of bits in magnitude
o Determine # of consecutive 0’s (1’s) in z, beginning with z1.
Adjust exponent accordingly

• Packing
Check for special results (zero, under-/overflow)
Remove the hidden 1

Counting vs. Predicting Leading Zeros/Ones
Simpler but on the
critical path
More complex
Floating Point Multiplication

• Simpler than floating-point addition
• Operation:
Output =
Sign: XOR
o Tentatively computed as e1+e2
o Subtract the bias (=127) HOW?
o Adjusted after normalization
o Result in the range [1,4) (inputs in the range [1,2) )
o Normalization: 1- or 2-bit shift right, depending on rounding
o Result is 2.(1+m) bits, should be rounded to (1+m) bits
o Rounding can gradually discard bits, instead of one last stage

Floating Point Multiplier Architecture
Pipelining is
used in
multiplier, as
well as block

• Most important elementary function
• In IEEE standard, specified a basic operation
(alongside +,-,*,/)
• Very similar to division
• Pencil-and-paper method:

Square root :

Square Rooting: Example

• Example: sqrt(9 52 41)

• Why double the partial root?
Partial root after step 2 is:
Appending the next digit
Square of which is 1
The term already subtracted
Find q0 such that is the
max number ≤ partial remainder

• The binary case:
Square of is:

Find q0 such that is ≤ partial
For the expression becomes (i.e.,
append “01” to the partial root)

Square Rooting: Example Base 2

• Example: sqrt(011101102) = sqrt(118)

Sequential Shift/Subtract Square Rooter Architecture

Other Methods for Square Rooting

• Restoring vs. non-restoring
We looked at the restoring algorithm
(after subtraction, restore partial remainder if the
result is negative)
Use a different encoding (use digits {-1,1} instead of
{0,1}) to avoid restoring

• High-radix
Similar to modified Booth encoding multiplication: take
care of more number of bits at a time
More complex circuit, but faster

• Convergence methods
Use the Newton method to approximate the function
approximates ,
multiply by z to get
Iteratively improve the accuracy
Can use lookup table for the first iteration

Square Rooting: Abstract Notation

Floating point format:
- Shift left (not right)
- Powers of 2 decreasing

Restoring Floating-Point Square Root Calc.

Nonrestoring Floating-Point Square Root Calc.

If final S negative, drop the last ‘1’ in q, and restore the
remainder to the last positive value .

Square Root Through Convergence

• Newton-Rapson method:


• Example: compute square root of z=(2.4)10

read out from table = 1.5 accurate to
accurate to
accurate to
accurate to
Non-Restoring Parallel Square Rooter

Function Evaluation

• We looked at square root calculation
Direct hardware implementation (binary, BSD, high-radix)
o Serial
o Parallel
Approximation (Newton method)

• What about other functions?
Direct implementation
o Example: can be directly implemented in hardware
(using square root as a sub-component)
Polynomial approximation
Table look-up
o Either as part of calculation or for the full calculation

Table Lookup
Direct table-lookup
Table-lookup with pre-and
Linear Interpolation Using Four Subinterval
Piecewise Table Lookup
Accuracy vs. Lookup Table Size Trade-off

Prev Next