Floating point/Lesson Four

Single Precision

In most computers, the IEEE sets a standard on how they should store numbers. This ensures that computer scientists are able to worry about the error, not learn how their particular computer operates.

Single precision numbers are numbers stored according to these rules:

1) Numbers are converted to ±k × 2^m, where k is a binary number in form 1.f and m is the exponent. The number k is between 1 and 2, but you won't use the first digit, because that is assumed.

2) The number, k, is rounded so that it only contains 24 bits.

IEEE Single Precision

3) The exponent can be from -126 to +127. It is stored in the computer as m + 127.

4) The computer stores the following:

a) The sign bit (1 if negative, 0 if positive)

b) The biased exponent (8 bits, 00000001 to 11111110, signifying 1 to 254, but actually representing -126 to 127).

c) The actual number. This section is called the mantissa. Because the number is assumed to have a 1 at the beginning, the 1 is not stored, but the next 23 digits are. The 1 is called the hidden bit.

The point between the exponent and the mantissa is known as the radix point. See the IEEE Single Precision picture.

Single Precision Example

The number 1 would be stored as the following ('1' is equal to 1 × 2°):

0        01111111           00000000000000000000000
positive exponent equal to  the first '1' is not included,
number   127 (=0 + 127)     next numbers are

Machine Epsilon, Single Precision

Machine epsilon, as a reminder, is the smallest possible number such that 1 + ε ≠ 1 on the machine. There are 23 bits available on the mantissa from the example above. Thus, as soon as 2^-23 is added, another '1' will be stored, namely the mantissa will read 00000000000000000000001. Thus, ε = 2^-23.

Matlab Code Example

Here, we demonstrate that machine epsilon is indeed 2^-23. The command 'single' forces Matlab to store the number as a single number.

>> single ( 1 - single ( 1 + 2^(-25) ) )

ans =

     0

>> single ( 1 - single ( 1 + 2^(-24) ) )

ans =

     0

>> single ( 1 - single ( 1 + 2^(-23) ) )

ans =

 -1.1921e-007

Double Precision

IEEE Double Precision

Double precision operates in the same manner as single precision, except more space is allocated to a number. Again, we have 1 sign bit, but we also have 11 bits for the exponent and 52 bits for the mantissa. The exponent is biased again, but this time, it is by 1023. Machine epsilon is 2^-52.

Interesting Proof

Here, we prove that the relative error of storing a number in single precision (indeed, any precision) is simply machine epsilon divided by 2. Chopping (not rounding) results in a relative error of ε.

Denote x_- as the machine number below the actual number, and x₊ as the number above. In single precision, x_-=(0.1b₁b₂b₃...b₂₄)₂ × 2^k. Additionally, x₊=[(0.1b₁b₂b₃...b₂₄)₂ + 2^-24] × 2^k.

Assume without loss of generality that x is closer to x_-. Then, |x - x_-| ≤ (1/2) |x₊ - x_-| = 2^{-25 + k}.

Then, $\left|{\frac {x-x_{-}}{x}}\right|\leq {\frac {2^{-25+k}}{(0.1b_{2}b_{3}b_{4}\ldots )_{2}\times 2^{k}}}\leq {\frac {2^{-25}}{\frac {1}{2}}}=2^{-24}={\frac {1}{2}}\epsilon$ .

Special Numbers

The 0 and 255 exponents, a -0 entry, and other values represent certain special numbers:

Type	Sign	Exponent	Significand	Value
Zero	0	0000 0000	000 0000 0000 0000 0000 0000	0.0
One	0	0111 1111	000 0000 0000 0000 0000 0000	1.0
Minus One	1	0111 1111	000 0000 0000 0000 0000 0000	−1.0
Smallest denormalized number	*	0000 0000	000 0000 0000 0000 0000 0001	±2⁻²³ × 2⁻¹²⁶ = ±2⁻¹⁴⁹ ≈ ±1.4 × 10^-45
"Middle" denormalized number	*	0000 0000	100 0000 0000 0000 0000 0000	±2⁻¹ × 2⁻¹²⁶ = ±2⁻¹²⁷ ≈ ±5.88 × 10^-39
Largest denormalized number	*	0000 0000	111 1111 1111 1111 1111 1111	±(1−2⁻²³) × 2⁻¹²⁶ ≈ ±1.18 × 10^-38
Smallest normalized number	*	0000 0001	000 0000 0000 0000 0000 0000	±2⁻¹²⁶ ≈ ±1.18 × 10^-38
Largest normalized number	*	1111 1110	111 1111 1111 1111 1111 1111	±(2−2⁻²³) × 2¹²⁷ ≈ ±3.4 × 10³⁸
Positive infinity	0	1111 1111	000 0000 0000 0000 0000 0000	$+\infty$
Negative infinity	1	1111 1111	000 0000 0000 0000 0000 0000	$-\infty$
Not a number	*	1111 1111	non zero	NaN
* Sign bit can be either 0 or 1 .