### Overview

Floating point is a way of  representing rational numbers in digital systems. The floating point numbers are represented  in a manner similar to scientific notation, where a number is represented as normalized significand and a multiplier:

c x be  Scientific notation

c – normalized significand  (the absolute value of c is between 1 and 10 e.g 1.2; 5; 7; 8.376 etc.) that contains the number’s digits.

The multiplier consist of:

b – base (10 is used as base in scientific notation)

e – exponent (integer value), that says where the decimal (or binary) point is placed relative to the beginning of the significand.

 Decimal notation Scientific Notation 400 4 x 102 0.2 2 x 10-1 24000 2.4 x 104

Advantages of floating point over fixed point:

• dynamic range – represent numbers at wildly different magnitudes (determined by the size of the exponent)
• precision – same relative accuracy at all magnitudes (determined by the size of the significand)

### IEEE Standard

The format used to store Floating Point numbers in memory has been standardized by the Institute of Electrical and Electronic Engineers as IEEE 754 standard.

The standard defines five formats:

• three binary floating-point basic formats (encoded with 32, 64 or 128 bits)
• in binary floating-point formats the exponent base is 2
• two decimal floating-point basic formats (encoded with 64 or 128 bits)
• in decimal floating-point formats the exponent base is 10

IEEE floating point numbers have three basic components:

• the signs
• the exponent e
• the significand c

The floating number is stored the following way: Fig. 1 IEEE 754 floating point format

By arranging the fields so that the sign bit is in the most significant bit position, the biased exponent in the middle, then the significand in the least significant bits, the resulting value will be ordered properly, whether it’s interpreted as a floating point or integer value. This allows high speed comparisons of floating point numbers using fixed point hardware.

The numerical value of a floating number can be expressed with the following equation:

(−1)s × c × b^e

The exponent base b is either 10 or 2 depending on thee floating-point format that is used.

Example:

if the base is 10, the sign is 1 (negative number), the significand is 12345, and the exponent is −4, then the value of the number is −11 × 12345 × 10−4 = −1 × 12345 × 0.0001 = −1.2345.

#### Sign Bit

Sign bit determines the sign of the number:

• Value 0 is for  positive number
• Value 1 is for negative number

#### Exponent

The exponent can represent both positive and negative values, but two’s complement representation is not used ( since it would make comparison harder if the stored floating point number is interpreted as integer). Instead a bias value  is added to actual exponent value. The bias value is dependent on the exponent size.

Example:

For IEEE single-precision floats, the bias value is 127.  If we have a stored value of 150 means that the actual exponent is 23 (150 – 127). If the stored value is 100, then we have actual bias value of  -27 (100 – 127). When the stored value is 127 then the actual bias value is zero.

#### Significand

The significand (some time referred to as mantissa or fraction) represents the precision bits of the number.

As in the scientific notation form, the floating-point numbers are normalized, which in short means the decimal ( or binary) point is placed after the first non zero digit of the number. Example of normalization:

Example of normalization for decimal numbers:

The number 60 is normalized as:

6 ×  100

The number 0.54 is normalized as:

5.4 × 10−2

When a binary number is normalized, the same rule is applied (the binary point is set after the first non-zero digit in the number). Since in binary form we have only ones and zeroes, this means that the value before the binary point is always 1. This is called an implicit leading bit and it does not have to be stored.