## Basic Floating Point Representation

Expert Reference: Prof. William Kahan, Lecture Notes on the Status of IEEE Standard 754 for Binary Floating-Point Arithmetic, EECS, University of California at Berkeley, May 1996. See this postscript paper for more authoritative information. (Note that the notation below is somewhat different, so that it is more consitent with the notation of MCS 471 Numerical Analysis.)

### Floating Point Representation According to IEEE 754 Standard:

• Table 1: Floating Point Precision Names:

 Format Name Bits Bytes C Name f90 Name Single 32 4 float real*4 Double 64 8 double real*8 Extended* 40+ 10+ long double? real*10+ Quadruple* 128 16 long double??? real*16

*   Optional formats, depending on processor and compiler.

Note: Kahan uses "N = p" for the precision of the fraction and "K+1=q" for the precision of the exponent".

• Table 2: Floating Point Precision Parameter Specifications:

 Format Name Bits Total Bits Sign Bits Exponential (q = Total - p) Bits Fraction    (p-1) Exponent Bound        (q-1) Base      b Precision      p Single 32 1 8 23 7 2 24 Double 64 1 11 52 10 2 53 Extended* 40+ 1 15+ 63+ 14+ 2 64+ Quadruple* 128 1 15 112 15 2 113

*   Optional formats, depending on processor and compiler.

• Binary Floating Point Representation:

Let x be a real floating point number, represented by

fl[x] = ± (1+frac) * 2^(exp) ,

where "exp" is the binary (base 2) exponent and "frac" is the non-negative binary fraction, with "sign" ± and normalized so that

0   <   frac   <   1,

but subject to precision dependent bounds as well. The equation "1+frac = 1.frac" gives alternate represenation of the normalized fraction part. The symbol "^" denotes power or exponentiation. Kahan uses the notation such that "k = exp" for unbiased exponent and "|n|=(1+frac)*2^(p-1)" for the unsigned significand integer for the fraction.

• Table 3: Normalized Binary Floating Point Parameter Bounds:

 Min[exp]=L Max[exp]=U Min[(1+frac)*2^(p-1)] Max[(1+frac)*2^(p-1)] Min[frac] Max[frac] 2-2^(q-1) 2^(q-1)-1 2^(p-1) 2^(p)-1 0 1-2^(1-p)

• Table 4: IEEE 754 Normalized Binary Floating Point Storage:

This table shows how each of 3 fields of the real floating point representation word is coded with integers [s | exp2pq | frac2pq] in the storage for each IEEE 754 Standard number type. Note that the first bit (the "1" in the "(1+frac)") of the "Signicand" or "Fraction" is not stored and the first bit of the "word" is the always the sign bit "s", so that "(-1)^(s) = ±1". For "Normal" numbers,

frac2pq = frac*2^(p-1)

is an integer, and

exp2pq = exp + bias,

where

bias = 2^(q-1)-1 = U,

such that

1 = L+bias < exp2pq = exp + bias < U + bias = 2*(2^(q-1)-1) = 2^q-2.

IEEE 754 Standards Storage in each Binary Field of the Floating Point Representation Word
for (-1)^(s)*(1+frac)*2^(exp)   -->   [s | exp2pq | frac2pq]:

 Number Type s = Sign Bit exp2pq = q Bit Exponent frac2pq = (p-1) Bits of Fraction Zero 0 or 1 exp2pq = 0 frac2pq = 0 Normal 0 or 1 exp2pq = exp + 2^(q-1) - 1 0 < frac2pq = frac*2^(p-1) < 2^(p-1) Subnormal 0 or 1 exp2pq = 0 0 < frac2pq < 2^(p-1) - 1 Infinity 0 or 1 exp2pq = 2^(q)-1 = 11...11 bin frac2pq = 0 NaN = Not a Number ? exp2pq = 2^(q)-1 = 11...11 bin 2^(p-2) < frac2pq = 1.... < 2^(p-1) - 1 SNaN = Signaling NaN ? exp2pq = 2^(q)-1 = 11...11 bin 1 < frac2pq = 0.... < 2^(p-2) - 1

Note: NaNs and Normals are distinguished by the NaN always having an exponent of "2^(q)-1" (all ones in binary: "exp2pq = 11...111") and a 1 in the leading fraction bit, while for normals the unbiased exponent upper bound, "exp < 2^(q-1)", means that the biased exponent "exp2pq < 2^(q)-2 = 11...110" which is always less than the NaN biased exponent by at least 1.

• IEEE 754 Overflow, Underflow, Gradual UnderFlow, Machine Epsilon, Significant Decimal Digits :

• Underflow and Gradual Underflow: The smallest normal absolute value follows from the the lower bound on the biased exponent "exp > (2-2^(q-1))" from Table 3 and the minumum fraction part "frac >0" also from Table 3, so that the Under Flow Level (UFL) is

|fl[x]| > (1+0)*2^(2-2^(q-1)) = 2^(2-2^(q-1)) = 2^L,   |x|>0,

and any smaller would be "Underflow". However, underflow for nonzero reals is not flagged until a number is smaller than the minimun subnormal of "e^(3-p-2^(q-1))", given by Kahan, and the number is converted to a zero. The subnormal permits the existence of "Gradual Underflow to Zero".

• Overflow: The largest normal absolute value follows from the upper bounds in Table 3 for the unbiased exponent "exp < (2^(q-1)-1)" and for the fractional part "frac < (1-2^(1-p))", so that Over Flow Level (OFL) is

|fl[x]| < (2-2^(1-p))*2^(2^(q-1)-1) = (1-2^(-p))*2^(2^(q-1)) = (2-2^(1-p))*2^U,   |x|>0,

Note that Overflow may be entered as an infinity. See the sample output below.

• Table 5: Some IEEE 754 Floating Point Binary Storage Single Precision (SP) Examples:

 Example s exp2pq frac2pq Under Flow Level (UFL) 0 00000001 00000000000000000000000 Over Flow Level (OFL) 0 11111110 11111111111111111111111 -0 1 00000000 00000000000000000000000 +1 0 01111111 00000000000000000000000 -1 1 01111111 00000000000000000000000 +2 0 10000000 00000000000000000000000 +3 0 10000000 10000000000000000000000 +4 0 10000001 00000000000000000000000

• Table 6: Machine Epsilon: The traditional machine epsilon, the smallest positive number that yields more than one when it is added to one, is "officially given by
MacEps = 2^(1-p),
but with rounding it should be
MacEpsRounding = 2^(1-p)/2 = 2^(-p),
see class lecture notes.

 Format Name Min Normal Max Normal Min Subnormal Machine Epsilon Sig. Decimal Digit Range Single 1.175e-38 3.403e+38 1.401e-45 1.192e-07 6-9 Double 2.2e-308 1.8e+308 4.9e-3246 2.220e-16 15-17 Extended <3.4e-4932 >3.4e+4932 <3.6e-4951 <1.084e-19 > 18-21 Quadruple 3.4e-4932 3.4e+4932 6.5e-4966 1.926e-34 33-36

Note: See Kahan for how the range of the equivalent significant decimal digits are computed, assuming there is no overflow or underflow. Note that the finite precision binary representation is not exactly converted to finite precision decimal representation and vice versa.

• Floating Point Arithmetic Output:

• Simple Example C Program:
```#include
#include
#define N 24
#define K 7
main()
{
int n, k, nmax, nmin, Two2TwoK;
float One, MinusOne, Zero, Xmax, Xmin, Xminminus, Xmaxplus, Xverysmall;
float Two2K, Two2N, Two2Nm1;
printf("Floating Point Arithmetic Examples:\n\n");
One=1.e0;
MinusOne=-1.e0;
Zero=0.e0;
printf("-1 = %14.7e (dec); 0 = %14.7e (dec); 1 = %14.7e (dec);\n\n",
MinusOne, Zero, One);
nmin = 1;
nmax = pow(2,N)-1;
Two2K = pow(2.,K);
Two2TwoK =  pow(2,2*K);
Two2N = pow(2,N);
Two2Nm1 = pow(2,N-1);
Xmin = pow(2.,2-Two2K);
Xmax = (1.e0-1.0e0/Two2N)*pow(2.,Two2K);
printf("Xmin = 2^(2-2^%d); Xmax = (1-1/2^%d)*2^(2^%d):\n", K, N, K);
printf("Xmin = %19.12e (dec); Xmax = %14.7e (dec)\n\n", Xmin, Xmax);
Xminminus = pow(2.,1-Two2K)*(1.e0+(Two2Nm1-1.e0)/Two2Nm1);
Xmaxplus =  pow(2.,Two2TwoK);
printf("Xminminus = 2^(1-2^%d)*(1+(2^%d-1)/2^%d); Xmaxplus = 2^(2^%d):\n",
K,N-1,N-1,K);
printf("Xminminus = %19.12e (dec); Xmaxplus = %14.7e (dec)\n",
Xminminus, Xmaxplus);
printf("Note Underflow is Gradual due to Subnormal Numbers;\n");
printf("But Overflow Leads to an Infinity.\n\n");
Xverysmall = 1.e0/pow(2.,Two2K);
printf("Xverysmall = 2^(-2^%d)= %19.12e (dec)\n",K,Xverysmall);
printf("Gradual Underflow is Smaller Than Underflow, But Not Zero.\n");
Xverysmall = 1.e0/pow(2.,pow(2,K+1));
printf("Xveryverysmall = 2^(-2^%d)= %19.12e (dec)\n",K+1,Xverysmall);
printf("If Underflow is Small Enough It is Not Gradual, But Really Zero.\n\n");
printf("sqrt(-1) = %e; 0*Infinity = %e; 0.0/0.0 = %e; Infinity/Infinity = %e;\n",
sqrt(-1.), 0*Xmaxplus, 0.0/0.0, Xmaxplus/Xmaxplus);
printf("Note the NaN (Not a Number) in These 4 Examples.\n");
} /* end main procedure */
```

• Simple Example C Program Output:
```Floating Point Arithmetic Examples:

-1 = -1.0000000e+00 (dec); 0 =  0.0000000e+00 (dec); 1 =  1.0000000e+00 (dec);

Xmin = 2^(2-2^7); Xmax = (1-1/2^24)*2^(2^7):
Xmin =  1.175494350822e-38 (dec); Xmax =  3.4028235e+38 (dec)

Xminminus = 2^(1-2^7)*(1+(2^23-1)/2^23); Xmaxplus = 2^(2^7):
Xminminus =  1.175494350822e-38 (dec); Xmaxplus =      +Infinity (dec)
Note Underflow is Gradual due to Subnormal Numbers;
But Overflow Leads to an Infinity.

Xverysmall = 2^(-2^7)=  2.938735877056e-39 (dec)
Gradual Underflow is Smaller Than Underflow, But Not Zero.
Xveryverysmall = 2^(-2^8)=  0.000000000000e+00 (dec)
If Underflow is Small Enough It is Not Gradual, But Really Zero.
sqrt(-1) = NaN; 0*Infinity = NaN; 0.0/0.0 = NaN; Infinity/Infinity = NaN
Note the NaN (Not a Number) in These 4 Examples.
```

Web Source: http://www.math.uic.edu/~hanson/mcs471/FloatingPointRep.html

Email Comments or Questions to hanson@uic.edu