Basic Floating Point Representation

Expert Reference: Prof. William Kahan, Lecture Notes on the Status of IEEE Standard 754 for Binary Floating-Point Arithmetic, EECS, University of California at Berkeley, May 1996. See this postscript paper for more authoritative information. (Note that the notation below is somewhat different, so that it is more consitent with the notation of MCS 471 Numerical Analysis.)

Floating Point Representation According to IEEE 754 Standard:

Table 1: Floating Point Precision Names:

Format Name Bits Bytes C Name f90 Name

Single 32 4 float real*4

Double 64 8 double real*8

Extended* 40+ 10+ long double? real*10+

Quadruple* 128 16 long double??? real*16

* Optional formats, depending on processor and compiler.
Note: Kahan uses "N = p" for the precision of the fraction and "K+1=q" for the precision of the exponent".

Table 2: Floating Point Precision Parameter Specifications:

Format Name	Bits Total	Bits Sign	Bits Exponential (q = Total - p)	Bits Fraction (p-1)	Exponent Bound (q-1)	Base b	Precision p
Single	32	1	8	23	7	2	24
Double	64	1	11	52	10	2	53
Extended*	40+	1	15+	63+	14+	2	64+
Quadruple*	128	1	15	112	15	2	113

* Optional formats, depending on processor and compiler.

Binary Floating Point Representation:

Table 3: Normalized Binary Floating Point Parameter Bounds:

Min[exp]=L	Max[exp]=U		*Min[(1+frac)2^(p-1)]**	*Max[(1+frac)2^(p-1)]**		Min[frac]	Max[frac]
2-2^(q-1)	2^(q-1)-1		2^(p-1)	2^(p)-1		0	1-2^(1-p)

Table 4: IEEE 754 Normalized Binary Floating Point Storage:

This table shows how each of 3 fields of the real floating point representation word is coded with integers [s | exp2pq | frac2pq] in the storage for each IEEE 754 Standard number type. Note that the first bit (the "1" in the "(1+frac)") of the "Signicand" or "Fraction" is not stored and the first bit of the "word" is the always the sign bit "s", so that "(-1)^(s) = ±1". For "Normal" numbers,

frac2pq = frac*2^(p-1)

is an integer, and

exp2pq = exp + bias,

where

bias = 2^(q-1)-1 = U,

such that

1 = L+bias < exp2pq = exp + bias < U + bias = 2*(2^(q-1)-1) = 2^q-2.

IEEE 754 Standards Storage in each Binary Field of the Floating Point Representation Word
for (-1)^(s)*(1+frac)*2^(exp) --> [s | exp2pq | frac2pq]:

Number Type	s = Sign Bit	exp2pq = q Bit Exponent	frac2pq = (p-1) Bits of Fraction
Zero	0 or 1	exp2pq = 0	frac2pq = 0
Normal	0 or 1	exp2pq = exp + 2^(q-1) - 1	0 < frac2pq = frac*2^(p-1) < 2^(p-1)
Subnormal	0 or 1	exp2pq = 0	0 < frac2pq < 2^(p-1) - 1
Infinity	0 or 1	exp2pq = 2^(q)-1 = 11...11 bin	frac2pq = 0
NaN = Not a Number	?	exp2pq = 2^(q)-1 = 11...11 bin	2^(p-2) < frac2pq = 1.... < 2^(p-1) - 1
SNaN = Signaling NaN	?	exp2pq = 2^(q)-1 = 11...11 bin	1 < frac2pq = 0.... < 2^(p-2) - 1

Note: NaNs and Normals are distinguished by the NaN always having an exponent of "2^(q)-1" (all ones in binary: "exp2pq = 11...111") and a 1 in the leading fraction bit, while for normals the unbiased exponent upper bound, "exp < 2^(q-1)", means that the biased exponent "exp2pq < 2^(q)-2 = 11...110" which is always less than the NaN biased exponent by at least 1.

IEEE 754 Overflow, Underflow, Gradual UnderFlow, Machine Epsilon, Significant Decimal Digits :

Underflow and Gradual Underflow: The smallest normal absolute value follows from the the lower bound on the biased exponent "exp > (2-2^(q-1))" from Table 3 and the minumum fraction part "frac >0" also from Table 3, so that the Under Flow Level (UFL) is
|fl[x]| > (1+0)*2^(2-2^(q-1)) = 2^(2-2^(q-1)) = 2^L, |x|>0,
and any smaller would be "Underflow". However, underflow for nonzero reals is not flagged until a number is smaller than the minimun subnormal of "e^(3-p-2^(q-1))", given by Kahan, and the number is converted to a zero. The subnormal permits the existence of "Gradual Underflow to Zero".
Overflow: The largest normal absolute value follows from the upper bounds in Table 3 for the unbiased exponent "exp < (2^(q-1)-1)" and for the fractional part "frac < (1-2^(1-p))", so that Over Flow Level (OFL) is
|fl[x]| < (2-2^(1-p))*2^(2^(q-1)-1) = (1-2^(-p))*2^(2^(q-1)) = (2-2^(1-p))*2^U, |x|>0,
Note that Overflow may be entered as an infinity. See the sample output below.

Table 5: Some IEEE 754 Floating Point Binary Storage Single Precision (SP) Examples:

Example	s	exp2pq	frac2pq
Under Flow Level (UFL)	0	00000001	00000000000000000000000
Over Flow Level (OFL)	0	11111110	11111111111111111111111
-0	1	00000000	00000000000000000000000
+1	0	01111111	00000000000000000000000
-1	1	01111111	00000000000000000000000
+2	0	10000000	00000000000000000000000
+3	0	10000000	10000000000000000000000
+4	0	10000001	00000000000000000000000

Table 6: Machine Epsilon: The traditional machine epsilon, the smallest positive number that yields more than one when it is added to one, is "officially given by MacEps = 2^(1-p), but with rounding it should be MacEpsRounding = 2^(1-p)/2 = 2^(-p), see class lecture notes.

Format Name	Min Normal	Max Normal	Min Subnormal	Machine Epsilon	Sig. Decimal Digit Range
Single	1.175e-38	3.403e+38	1.401e-45	1.192e-07	6-9
Double	2.2e-308	1.8e+308	4.9e-3246	2.220e-16	15-17
Extended	<3.4e-4932	>3.4e+4932	<3.6e-4951	<1.084e-19	> 18-21
Quadruple	3.4e-4932	3.4e+4932	6.5e-4966	1.926e-34	33-36

Note: See Kahan for how the range of the equivalent significant decimal digits are computed, assuming there is no overflow or underflow. Note that the finite precision binary representation is not exactly converted to finite precision decimal representation and vice versa.

Floating Point Arithmetic Output:

Simple Example C Program:

#include 
#include 
#define N 24
#define K 7
main()
{
int n, k, nmax, nmin, Two2TwoK;
float One, MinusOne, Zero, Xmax, Xmin, Xminminus, Xmaxplus, Xverysmall;
float Two2K, Two2N, Two2Nm1;
printf("Floating Point Arithmetic Examples:\n\n");
One=1.e0;
MinusOne=-1.e0;
Zero=0.e0;
printf("-1 = %14.7e (dec); 0 = %14.7e (dec); 1 = %14.7e (dec);\n\n",
MinusOne, Zero, One);
nmin = 1;
nmax = pow(2,N)-1;
Two2K = pow(2.,K);
Two2TwoK =  pow(2,2*K);
Two2N = pow(2,N);
Two2Nm1 = pow(2,N-1);
Xmin = pow(2.,2-Two2K);
Xmax = (1.e0-1.0e0/Two2N)*pow(2.,Two2K);
printf("Xmin = 2^(2-2^%d); Xmax = (1-1/2^%d)*2^(2^%d):\n", K, N, K);
printf("Xmin = %19.12e (dec); Xmax = %14.7e (dec)\n\n", Xmin, Xmax);
Xminminus = pow(2.,1-Two2K)*(1.e0+(Two2Nm1-1.e0)/Two2Nm1);
Xmaxplus =  pow(2.,Two2TwoK);
printf("Xminminus = 2^(1-2^%d)*(1+(2^%d-1)/2^%d); Xmaxplus = 2^(2^%d):\n",
K,N-1,N-1,K);
printf("Xminminus = %19.12e (dec); Xmaxplus = %14.7e (dec)\n", 
Xminminus, Xmaxplus);
printf("Note Underflow is Gradual due to Subnormal Numbers;\n");
printf("But Overflow Leads to an Infinity.\n\n");
Xverysmall = 1.e0/pow(2.,Two2K);
printf("Xverysmall = 2^(-2^%d)= %19.12e (dec)\n",K,Xverysmall);
printf("Gradual Underflow is Smaller Than Underflow, But Not Zero.\n");
Xverysmall = 1.e0/pow(2.,pow(2,K+1));
printf("Xveryverysmall = 2^(-2^%d)= %19.12e (dec)\n",K+1,Xverysmall);
printf("If Underflow is Small Enough It is Not Gradual, But Really Zero.\n\n");
printf("sqrt(-1) = %e; 0*Infinity = %e; 0.0/0.0 = %e; Infinity/Infinity = %e;\n",
sqrt(-1.), 0*Xmaxplus, 0.0/0.0, Xmaxplus/Xmaxplus);
printf("Note the NaN (Not a Number) in These 4 Examples.\n");
} /* end main procedure */

Simple Example C Program Output:

Floating Point Arithmetic Examples:

-1 = -1.0000000e+00 (dec); 0 =  0.0000000e+00 (dec); 1 =  1.0000000e+00 (dec);

Xmin = 2^(2-2^7); Xmax = (1-1/2^24)*2^(2^7):
Xmin =  1.175494350822e-38 (dec); Xmax =  3.4028235e+38 (dec)

Xminminus = 2^(1-2^7)*(1+(2^23-1)/2^23); Xmaxplus = 2^(2^7):
Xminminus =  1.175494350822e-38 (dec); Xmaxplus =      +Infinity (dec)
Note Underflow is Gradual due to Subnormal Numbers;
But Overflow Leads to an Infinity.

Xverysmall = 2^(-2^7)=  2.938735877056e-39 (dec)
Gradual Underflow is Smaller Than Underflow, But Not Zero.
Xveryverysmall = 2^(-2^8)=  0.000000000000e+00 (dec)
If Underflow is Small Enough It is Not Gradual, But Really Zero.
sqrt(-1) = NaN; 0*Infinity = NaN; 0.0/0.0 = NaN; Infinity/Infinity = NaN
Note the NaN (Not a Number) in These 4 Examples.

Web Source: http://www.math.uic.edu/~hanson/mcs471/FloatingPointRep.html

Email Comments or Questions to hanson@uic.edu

Click Here For Class HomePage

Format Name	Bits	Bytes	C Name	f90 Name
Single	32	4	float	real*4
Double	64	8	double	real*8
Extended*	40+	10+	long double?	real*10+
Quadruple*	128	16	long double???	real*16