How does C handle floating-point numbers
We’ve already learned about storing single number in one byte (containing 8 bits) – that’s actually the same like storing ASCII-table characters in a single char variable. Depending on the unsigned/singed definition the range is -127 to 127 or 0 to 255 (with 0 as separate value). That’s simple. However, what happens when it comes to floating-point number?
Back to the basics – decimal system to binary system
In order to explain everything properly we have to go back to the basics. Let’s take a simple decimal number:
6293
We can read this number and understand its value pretty easily. However, computers are not that smart as the base numbering system they are using is a binary one. So let’s perform simple conversion. An example here says more than the most sophisticated explanation.
6293 / 2 = 3146,5 // Pay attention to the 'rest'. As it's not 0 in binary we write 1
Temporary binary value: 1
3146 / 2 = 1573 // This time the 'rest' is 0. So to the binary output we assing 0
Temporary binary value: 01
1573 / 2 = 786,5 // The same way like in step 1 we assign 1
Temporary binary value: 101
786 / 2 = 393 // This time 0
Temporary binary value: 0101
393 / 2 = 196,5 // Again 1
Temporary binary value: 10101
196 / 2 = 98 // You get it right?
Temporary binary value: 010101
98 / 2 = 49
Temporary binary value: 0010101
49 / 2 = 24,5
Temporary binary value: 10010101
24 / 2 = 12
Temporary binary value: 010010101
12 / 2 = 6
Temporary binary value: 0010010101
6 / 2 = 3
Temporary binary value: 00010010101
3 / 2 = 1,5
Temporary binary value: 100010010101
1 / 2 = 0,5
Binary value: 1100010010101
Now we should be operating on the 0 as value. However, that’s where the algorithm ends as the base number must be larger than 0. Therefore we’re done – output of conversion is 1100010010101. You can check the result in any available online converter, but trust me, it’s right.
Of course there is no need for You to make such conversions every day. I’ve just shown it to You in order to present a next step in our journey – presenting fractions in binary.
Gimme some fraction
An algorithm for converting fraction part of a number is a little bit different from a decimal one. Let’s take a simple fraction as an example (I use a comma instead of period to separate decimal from the fraction – I think it’s common thing in Europe):
0,75
First we start with writing 0, (as we store fraction there is point there). Then we multiply our value with two.
0,75 * 2 = 1,5 // If the result is >= 1 then we write 1
Temporary output: 0,1
As the result above was bigger than 1 we substract this from the result and operation result is our new value
0,5 * 2 = 1 // It's one, so we just assign 1
Output: 0,11
The value we should be using right now is 1 and that actually ends an algorithm. Final result is 0.11
Ok, so far so good. I bet however, that You’ve seen more complicated and undetermined value. Let’s try new one.
0,95
First we start with writing 0. (as we store fraction there is point there). Then we multiply our value with two.
0,95 * 2 = 1,9 // If the result is >= 1 then we write 1
Temporary output: 0,1
0,9 * 2 = 1,8 // It's again bigger than 1 so we repeat operation from step one
Temporary output: 0,11
0,8 * 2 = 1,9 // You know where it's going
Temporary output: 0,111
As You can see we should again compute using value 0,9. That means that our fraction does not have finite binary representation. The result should be written in a form of 0,1(11) which can be extended to eg. 0,111111111 to the infinity 😉
Before we go further – scientific notation to the rescue
Scientists are the ones that often need very detailed precision when it comes to numbers used. What is more – they usually need ranges than regular people will never see (even Warren Buffet counting his billions of dollars). In order to make it easier to write they’ve came with an idea to write large numbers down without it taking too much space. That method is called scientific notation.
The formula is simple:
x = sM * 10E
x is a resulting scientific notation
s it is a sign of the number (minus or plus)
M is a mantiss, which is a number in the range from 1 (inclusive) to 10 (exclusive)
e is expotent
So if we take above examples with number 6293,75 (which is small in fact, but just to stick to familiar example) we can write it down like this:
6,29375 * 10^3
or
6,29375E3
If we wanted to operate on a fraction and write it using this notation we could to it like this (for number 0,00482)
4,82 * 10^-3
or
4,82E-3
To see this in motion You can run below program:
int main(void) { printf("Big number: %e \n", 6293.75); printf("Small number: %e \n", 0.00482); return 0; }
The output is:
Big number: 6.293750e+003
Small number: 4.820000e-003
Why do I care?
That’s an excellent question. An answer is simple – nowadays there is a standard every computer (at least according to Wikipedia) implements – IEEE754. This standard describes not only a way the processors should store binary representation of floating-point numbers, but also says how the operations should be performed. Scientific notation is exactly what it is. Below there is an image of how standard defines storing numbers (single precision ones – we will get to that). I’ve taken it from Wiki article about IEE754, which I recommend reading if You want to know even more.
As You can see single precision storage uses 32bits. The first bit contains a sign (1 for negative, 0 for positive), next 8bits are used to store expotent. Remember the discussion how many characters can be stored in single C char variable? Yes, You’re right, that depends on whether it’s signed or unsigned. Expotent can also have a sign remember? So it would be wise to allow negative values here – therefore the range is -127 to 127. Last part is a mantiss which for single precision uses 23 bits for storage. Double precision has the same logic but with bigger numbers. Sign occupies the same one bit (surprise, surprise), expotent can use up to 11bits (values in range from -1023 to 1023) and mantiss got 52 bits.
Using this numbers it is easy to figure out ranges of provided notations. For single precision it is -3,4 x 10^38 … 3,4 x 10^38, and for double precision it is -1,8 x 10^308 … 1,8 x 10^308.
There are also situations, when we want to store not the number itself, but some kind of information about it. That is a case for zero (that can be negative zero or positive zero) or information that stored value is NaN (not a number). With all the bits at our disposal it is pretty easy.
Value | Value as string | Sign | Expotent | Mantiss |
---|---|---|---|---|
subnormal numbers | 0 or 1 | 00000000 | different than 0 | |
zero | 0 | 0 | 00000000 | 00000000000000000000000 |
negative zero | -0 | 1 | 00000000 | 00000000000000000000000 |
infinity | +∞ | 0 | 11111111 | 00000000000000000000000 |
negative infinity | -∞ | 1 | 11111111 | 00000000000000000000000 |
not a number | NaN | vary | 11111111 | !=0 |
Summing up
With all that knowledge about storing floating-points numbers, rounding, precisions, etc. it should be obvious why sometimes calculating even simple arithmetic can result in strange behaviour. With limited precision it is impossible to achieve perfect accuracy, however, it seems that the world is doing just fine with the current situation.
Leave a Reply
You must be logged in to post a comment.