Expert Reference: Prof. William Kahan, Lecture Notes on the Status of IEEE Standard 754 for Binary Floating-Point Arithmetic, EECS, University of California at Berkeley, May 1996. See this postscript paper for more authoritative information. (Note that the notation below is somewhat different, so that it is more consitent with the notation of MCS 471 Numerical Analysis.)
Format Name | Bits | Bytes | C Name | f90 Name |
Single | 32 | 4 | float | real*4 |
Double | 64 | 8 | double | real*8 |
Extended* | 40+ | 10+ | long double? | real*10+ |
Quadruple* | 128 | 16 | long double??? | real*16 |
* Optional formats, depending on processor and compiler.
Note: Kahan uses "N = p" for the precision of the fraction and "K+1=q" for the precision of the exponent".
Format Name | Bits Total | Bits Sign | Bits Exponential (q = Total - p) | Bits Fraction (p-1) | Exponent Bound (q-1) | Base b | Precision p | ||
Single | 32 | 1 | 8 | 23 | 7 | 2 | 24 | ||
Double | 64 | 1 | 11 | 52 | 10 | 2 | 53 | ||
Extended* | 40+ | 1 | 15+ | 63+ | 14+ | 2 | 64+ | ||
Quadruple* | 128 | 1 | 15 | 112 | 15 | 2 | 113 | ||
* Optional formats, depending on processor and compiler.
Let x be a real floating point number, represented by
fl[x] = ± (1+frac) * 2^(exp) ,
where "exp" is the binary (base 2) exponent and "frac" is the non-negative binary fraction, with "sign" ± and normalized so that
0 < frac < 1,
but subject to precision dependent bounds as well. The equation "1+frac = 1.frac" gives alternate represenation of the normalized fraction part. The symbol "^" denotes power or exponentiation. Kahan uses the notation such that "k = exp" for unbiased exponent and "|n|=(1+frac)*2^(p-1)" for the unsigned significand integer for the fraction.
Min[exp]=L | Max[exp]=U | Min[(1+frac)*2^(p-1)] | Max[(1+frac)*2^(p-1)] | Min[frac] | Max[frac] | ||
2-2^(q-1) | 2^(q-1)-1 | 2^(p-1) | 2^(p)-1 | 0 | 1-2^(1-p) | ||
This table shows how each of 3 fields of the real floating point representation word is coded with integers [s | exp2pq | frac2pq] in the storage for each IEEE 754 Standard number type. Note that the first bit (the "1" in the "(1+frac)") of the "Signicand" or "Fraction" is not stored and the first bit of the "word" is the always the sign bit "s", so that "(-1)^(s) = ±1". For "Normal" numbers,
frac2pq = frac*2^(p-1)
is an integer, and
exp2pq = exp + bias,
where
bias = 2^(q-1)-1 = U,
such that
1 = L+bias < exp2pq = exp + bias < U + bias = 2*(2^(q-1)-1) = 2^q-2.
Number Type | s = Sign Bit | exp2pq = q Bit Exponent | frac2pq = (p-1) Bits of Fraction | |
Zero | 0 or 1 | exp2pq = 0 | frac2pq = 0 | |
Normal | 0 or 1 | exp2pq = exp + 2^(q-1) - 1 | 0 < frac2pq = frac*2^(p-1) < 2^(p-1) | |
Subnormal | 0 or 1 | exp2pq = 0 | 0 < frac2pq < 2^(p-1) - 1 | |
Infinity | 0 or 1 | exp2pq = 2^(q)-1 = 11...11 bin | frac2pq = 0 | |
NaN = Not a Number | ? | exp2pq = 2^(q)-1 = 11...11 bin | 2^(p-2) < frac2pq = 1.... < 2^(p-1) - 1 | |
SNaN = Signaling NaN | ? | exp2pq = 2^(q)-1 = 11...11 bin | 1 < frac2pq = 0.... < 2^(p-2) - 1 | |
Note: NaNs and Normals are distinguished by the NaN always having an exponent of "2^(q)-1" (all ones in binary: "exp2pq = 11...111") and a 1 in the leading fraction bit, while for normals the unbiased exponent upper bound, "exp < 2^(q-1)", means that the biased exponent "exp2pq < 2^(q)-2 = 11...110" which is always less than the NaN biased exponent by at least 1.
and any smaller would be "Underflow". However, underflow for nonzero reals is not flagged until a number is smaller than the minimun subnormal of "e^(3-p-2^(q-1))", given by Kahan, and the number is converted to a zero. The subnormal permits the existence of "Gradual Underflow to Zero".
Note that Overflow may be entered as an infinity. See the sample output below.
Example | s | exp2pq | frac2pq | |
Under Flow Level (UFL) | 0 | 00000001 | 00000000000000000000000 | |
Over Flow Level (OFL) | 0 | 11111110 | 11111111111111111111111 | |
-0 | 1 | 00000000 | 00000000000000000000000 | |
+1 | 0 | 01111111 | 00000000000000000000000 | |
-1 | 1 | 01111111 | 00000000000000000000000 | |
+2 | 0 | 10000000 | 00000000000000000000000 | |
+3 | 0 | 10000000 | 10000000000000000000000 | |
+4 | 0 | 10000001 | 00000000000000000000000 |
Format Name | Min Normal | Max Normal | Min Subnormal | Machine Epsilon | Sig. Decimal Digit Range |
Single | 1.175e-38 | 3.403e+38 | 1.401e-45 | 1.192e-07 | 6-9 |
Double | 2.2e-308 | 1.8e+308 | 4.9e-3246 | 2.220e-16 | 15-17 |
Extended | <3.4e-4932 | >3.4e+4932 | <3.6e-4951 | <1.084e-19 | > 18-21 |
Quadruple | 3.4e-4932 | 3.4e+4932 | 6.5e-4966 | 1.926e-34 | 33-36 |
Note: See Kahan for how the range of the equivalent significant decimal digits are computed, assuming there is no overflow or underflow. Note that the finite precision binary representation is not exactly converted to finite precision decimal representation and vice versa.
#include#include #define N 24 #define K 7 main() { int n, k, nmax, nmin, Two2TwoK; float One, MinusOne, Zero, Xmax, Xmin, Xminminus, Xmaxplus, Xverysmall; float Two2K, Two2N, Two2Nm1; printf("Floating Point Arithmetic Examples:\n\n"); One=1.e0; MinusOne=-1.e0; Zero=0.e0; printf("-1 = %14.7e (dec); 0 = %14.7e (dec); 1 = %14.7e (dec);\n\n", MinusOne, Zero, One); nmin = 1; nmax = pow(2,N)-1; Two2K = pow(2.,K); Two2TwoK = pow(2,2*K); Two2N = pow(2,N); Two2Nm1 = pow(2,N-1); Xmin = pow(2.,2-Two2K); Xmax = (1.e0-1.0e0/Two2N)*pow(2.,Two2K); printf("Xmin = 2^(2-2^%d); Xmax = (1-1/2^%d)*2^(2^%d):\n", K, N, K); printf("Xmin = %19.12e (dec); Xmax = %14.7e (dec)\n\n", Xmin, Xmax); Xminminus = pow(2.,1-Two2K)*(1.e0+(Two2Nm1-1.e0)/Two2Nm1); Xmaxplus = pow(2.,Two2TwoK); printf("Xminminus = 2^(1-2^%d)*(1+(2^%d-1)/2^%d); Xmaxplus = 2^(2^%d):\n", K,N-1,N-1,K); printf("Xminminus = %19.12e (dec); Xmaxplus = %14.7e (dec)\n", Xminminus, Xmaxplus); printf("Note Underflow is Gradual due to Subnormal Numbers;\n"); printf("But Overflow Leads to an Infinity.\n\n"); Xverysmall = 1.e0/pow(2.,Two2K); printf("Xverysmall = 2^(-2^%d)= %19.12e (dec)\n",K,Xverysmall); printf("Gradual Underflow is Smaller Than Underflow, But Not Zero.\n"); Xverysmall = 1.e0/pow(2.,pow(2,K+1)); printf("Xveryverysmall = 2^(-2^%d)= %19.12e (dec)\n",K+1,Xverysmall); printf("If Underflow is Small Enough It is Not Gradual, But Really Zero.\n\n"); printf("sqrt(-1) = %e; 0*Infinity = %e; 0.0/0.0 = %e; Infinity/Infinity = %e;\n", sqrt(-1.), 0*Xmaxplus, 0.0/0.0, Xmaxplus/Xmaxplus); printf("Note the NaN (Not a Number) in These 4 Examples.\n"); } /* end main procedure */
Floating Point Arithmetic Examples: -1 = -1.0000000e+00 (dec); 0 = 0.0000000e+00 (dec); 1 = 1.0000000e+00 (dec); Xmin = 2^(2-2^7); Xmax = (1-1/2^24)*2^(2^7): Xmin = 1.175494350822e-38 (dec); Xmax = 3.4028235e+38 (dec) Xminminus = 2^(1-2^7)*(1+(2^23-1)/2^23); Xmaxplus = 2^(2^7): Xminminus = 1.175494350822e-38 (dec); Xmaxplus = +Infinity (dec) Note Underflow is Gradual due to Subnormal Numbers; But Overflow Leads to an Infinity. Xverysmall = 2^(-2^7)= 2.938735877056e-39 (dec) Gradual Underflow is Smaller Than Underflow, But Not Zero. Xveryverysmall = 2^(-2^8)= 0.000000000000e+00 (dec) If Underflow is Small Enough It is Not Gradual, But Really Zero. sqrt(-1) = NaN; 0*Infinity = NaN; 0.0/0.0 = NaN; Infinity/Infinity = NaN Note the NaN (Not a Number) in These 4 Examples.
Email Comments or Questions to hanson@uic.edu