# How does C handle floating-point numbers

We’ve already learned about storing single number in one byte (containing **8 bits**) – that’s actually the same like storing ASCII-table characters in a single **char** variable. Depending on the *unsigned/singed* definition the range is *-127* to *127* or *0* to *255* (with *0* as separate value). That’s simple. However, what happens when it comes to floating-point number?

### Back to the basics – decimal system to binary system

In order to explain everything properly we have to go back to the basics. Let’s take a simple decimal number:

`6293`

We can read this number and understand its value pretty easily. However, computers are not that smart as the base numbering system they are using is a binary one. So let’s perform simple conversion. An example here says more than the most sophisticated explanation.

`6293 / 2 = 3146,5 // Pay attention to the 'rest'. As it's not 0 in binary we write 1`

Temporary binary value: *1*

`3146 / 2 = 1573 // This time the 'rest' is 0. So to the binary output we assing 0`

Temporary binary value: *01*

`1573 / 2 = 786,5 // The same way like in step 1 we assign 1`

Temporary binary value: *101*

`786 / 2 = 393 // This time 0`

Temporary binary value: *0101*

`393 / 2 = 196,5 // Again 1`

Temporary binary value: *10101*

`196 / 2 = 98 // You get it right?`

Temporary binary value: *010101*

`98 / 2 = 49 `

Temporary binary value: *0010101*

`49 / 2 = 24,5 `

Temporary binary value: *10010101*

`24 / 2 = 12`

Temporary binary value: *010010101*

`12 / 2 = 6`

Temporary binary value: *0010010101*

`6 / 2 = 3`

Temporary binary value: *00010010101*

`3 / 2 = 1,5`

Temporary binary value: *100010010101*

`1 / 2 = 0,5 `

Binary value: *1100010010101*

Now we should be operating on the *0* as value. However, that’s where the algorithm ends as the base number must be larger than *0*. Therefore we’re done – output of conversion is *1100010010101*. You can check the result in any available online converter, but trust me, it’s right.

Of course there is no need for You to make such conversions every day. I’ve just shown it to You in order to present a next step in our journey – presenting fractions in binary.

### Gimme some fraction

An algorithm for converting fraction part of a number is a little bit different from a decimal one. Let’s take a simple fraction as an example (I use a comma instead of period to separate decimal from the fraction – I think it’s common thing in Europe):

`0,75`

First we start with writing *0,* (as we store fraction there is point there). Then we multiply our value with two.

`0,75 * 2 = 1,5 // If the result is >= 1 then we write 1`

Temporary output: *0,1*

As the result above was bigger than 1 we substract this from the result and operation result is our new value

`0,5 * 2 = 1 // It's one, so we just assign 1`

Output: *0,11*

The value we should be using right now is *1* and that actually ends an algorithm. Final result is *0.11*

Ok, so far so good. I bet however, that You’ve seen more complicated and undetermined value. Let’s try new one.

`0,95`

First we start with writing *0.* (as we store fraction there is point there). Then we multiply our value with two.

`0,95 * 2 = 1,9 // If the result is >= 1 then we write 1`

Temporary output: *0,1*

`0,9 * 2 = 1,8 // It's again bigger than 1 so we repeat operation from step one`

Temporary output: *0,11*

`0,8 * 2 = 1,9 // You know where it's going`

Temporary output: *0,111*

As You can see we should again compute using value *0,9*. That means that our fraction does not have finite binary representation. The result should be written in a form of *0,1(11)* which can be extended to eg. *0,111111111* to the infinity 😉

### Before we go further – scientific notation to the rescue

Scientists are the ones that often need very detailed precision when it comes to numbers used. What is more – they usually need ranges than regular people will never see (even Warren Buffet counting his billions of dollars). In order to make it easier to write they’ve came with an idea to write large numbers down without it taking too much space. That method is called **scientific notation**.

The formula is simple:

`x = sM * 10E`

**x** is a resulting scientific notation

**s** it is a sign of the number (minus or plus)

**M** is a **mantiss**, which is a number in the range from *1* (inclusive) to *10* (exclusive)

**e** is **expotent**

So if we take above examples with number *6293,75* (which is small in fact, but just to stick to familiar example) we can write it down like this:

`6,29375 * 10^3`

or

`6,29375E3`

If we wanted to operate on a fraction and write it using this notation we could to it like this (for number *0,00482*)

`4,82 * 10^-3`

or

`4,82E-3`

To see this in motion You can run below program:

int main(void) { printf("Big number: %e \n", 6293.75); printf("Small number: %e \n", 0.00482); return 0; }

The output is:

Big number: 6.293750e+003

Small number: 4.820000e-003

### Why do I care?

That’s an excellent question. An answer is simple – nowadays there is a standard every computer (at least according to Wikipedia) implements – *IEEE754*. This standard describes not only a way the processors should store binary representation of floating-point numbers, but also says how the operations should be performed. Scientific notation is exactly what it is. Below there is an image of how standard defines storing numbers (**single precision ones** – we will get to that). I’ve taken it from Wiki article about IEE754, which I recommend reading if You want to know even more.

As You can see **single precision storage** uses *32bits*. The first bit contains a sign (*1* for negative, *0* for positive), next *8bits* are used to store **expotent**. Remember the discussion how many characters can be stored in single **C char** variable? Yes, You’re right, that depends on whether it’s signed or unsigned. **Expotent** can also have a sign remember? So it would be wise to allow negative values here – therefore the range is *-127* to *127*. Last part is a **mantiss** which for single **precision** uses *23 bits* for storage. **Double precision** has the same logic but with bigger numbers. Sign occupies the same one bit (surprise, surprise), **expotent** can use up to *11bits* (values in range from *-1023* to *1023*) and **mantiss** got *52 bits*.

Using this numbers it is easy to figure out ranges of provided notations. For **single precision** it is *-3,4 x 10^38 … 3,4 x 10^38*, and for **double precision** it is *-1,8 x 10^308 … 1,8 x 10^308*.

There are also situations, when we want to store not the number itself, but some kind of information about it. That is a case for zero (that can be **negative zero** or **positive zero**) or information that stored value is **NaN** (not a number). With all the bits at our disposal it is pretty easy.

Value |
Value as string |
Sign |
Expotent |
Mantiss |
---|---|---|---|---|

subnormal numbers | 0 or 1 | 00000000 | different than 0 | |

zero | 0 | 0 | 00000000 | 00000000000000000000000 |

negative zero | -0 | 1 | 00000000 | 00000000000000000000000 |

infinity | +∞ | 0 | 11111111 | 00000000000000000000000 |

negative infinity | -∞ | 1 | 11111111 | 00000000000000000000000 |

not a number | NaN | vary | 11111111 | !=0 |

### Summing up

With all that knowledge about storing floating-points numbers, rounding, precisions, etc. it should be obvious why sometimes calculating even simple arithmetic can result in strange behaviour. With limited precision it is impossible to achieve perfect accuracy, however, it seems that the world is doing just fine with the current situation.

## Leave a Reply