# Floating point/Lesson Four

### Single Precision

In most computers, the IEEE sets a standard on how they should store numbers. This ensures that computer scientists are able to worry about the error, not learn how their particular computer operates.

Single precision numbers are numbers stored according to these rules:

1) Numbers are converted to ±k × 2m, where k is a binary number in form 1.f and m is the exponent. The number k is between 1 and 2, but you won't use the first digit, because that is assumed.

2) The number, k, is rounded so that it only contains 24 bits.

3) The exponent can be from -126 to +127. It is stored in the computer as m + 127.

4) The computer stores the following:

a) The sign bit (1 if negative, 0 if positive)

b) The biased exponent (8 bits, 00000001 to 11111110, signifying 1 to 254, but actually representing -126 to 127).

c) The actual number. This section is called the mantissa. Because the number is assumed to have a 1 at the beginning, the 1 is not stored, but the next 23 digits are. The 1 is called the hidden bit.

The point between the exponent and the mantissa is known as the radix point. See the IEEE Single Precision picture.

### Single Precision Example

The number 1 would be stored as the following ('1' is equal to 1 × 2°):

0        01111111           00000000000000000000000
positive exponent equal to  the first '1' is not included,
number   127 (=0 + 127)     next numbers are


### Machine Epsilon, Single Precision

Machine epsilon, as a reminder, is the smallest possible number such that 1 + ε ≠ 1 on the machine. There are 23 bits available on the mantissa from the example above. Thus, as soon as 2-23 is added, another '1' will be stored, namely the mantissa will read 00000000000000000000001. Thus, ε = 2-23.

### Matlab Code Example

Here, we demonstrate that machine epsilon is indeed 2-23. The command 'single' forces Matlab to store the number as a single number.

>> single ( 1 - single ( 1 + 2^(-25) ) )

ans =

0

>> single ( 1 - single ( 1 + 2^(-24) ) )

ans =

0

>> single ( 1 - single ( 1 + 2^(-23) ) )

ans =

-1.1921e-007


### Double Precision

Double precision operates in the same manner as single precision, except more space is allocated to a number. Again, we have 1 sign bit, but we also have 11 bits for the exponent and 52 bits for the mantissa. The exponent is biased again, but this time, it is by 1023. Machine epsilon is 2-52.

### Interesting Proof

Here, we prove that the relative error of storing a number in single precision (indeed, any precision) is simply machine epsilon divided by 2. Chopping (not rounding) results in a relative error of ε.

Denote x- as the machine number below the actual number, and x+ as the number above. In single precision, x-=(0.1b1b2b3...b24)2 × 2k. Additionally, x+=[(0.1b1b2b3...b24)2 + 2-24] × 2k.

Assume without loss of generality that x is closer to x-. Then, |x - x-| ≤ (1/2) |x+ - x-| = 2-25 + k.

Then, ${\displaystyle \left|{\frac {x-x_{-}}{x}}\right|\leq {\frac {2^{-25+k}}{(0.1b_{2}b_{3}b_{4}\ldots )_{2}\times 2^{k}}}\leq {\frac {2^{-25}}{\frac {1}{2}}}=2^{-24}={\frac {1}{2}}\epsilon }$ .

### Special Numbers

The 0 and 255 exponents, a -0 entry, and other values represent certain special numbers:

Type Sign Exponent Significand Value
Zero 0 0000 0000 000 0000 0000 0000 0000 0000 0.0
One 0 0111 1111 000 0000 0000 0000 0000 0000 1.0
Minus One 1 0111 1111 000 0000 0000 0000 0000 0000 −1.0
Smallest denormalized number * 0000 0000 000 0000 0000 0000 0000 0001 ±2−23 × 2−126 = ±2−149 ≈ ±1.4 × 10-45
"Middle" denormalized number * 0000 0000 100 0000 0000 0000 0000 0000 ±2−1 × 2−126 = ±2−127 ≈ ±5.88 × 10-39
Largest denormalized number * 0000 0000 111 1111 1111 1111 1111 1111 ±(1−2−23) × 2−126 ≈ ±1.18 × 10-38
Smallest normalized number * 0000 0001 000 0000 0000 0000 0000 0000 ±2−126 ≈ ±1.18 × 10-38
Largest normalized number * 1111 1110 111 1111 1111 1111 1111 1111 ±(2−2−23) × 2127 ≈ ±3.4 × 1038
Positive infinity 0 1111 1111 000 0000 0000 0000 0000 0000 ${\displaystyle +\infty }$
Negative infinity 1 1111 1111 000 0000 0000 0000 0000 0000 ${\displaystyle -\infty }$
Not a number * 1111 1111 non zero NaN
* Sign bit can be either 0 or 1 .

(copied from IEEE 754-1985)

### Homework

1) What is the difference between machine epsilon and realmin?

2) In single precision, what will the following computations yield?

1 + ε

1 + realmin

realmin + ε

(2 + ε) - 1

(-1 + ε) + 2

3) Find realmax in single precision, both by hand, and by using Matlab.

### Sources

(including proof) Cheney Ward and David Kincaid. Numerical Methods and Computing. Belmont, CA: Thomson, 2004.