Bytes, mappings, encodings and all that jazz
In the modern web-based programming the problem of encoding and usage of different languages seems like a thing of the past. I still remember how I’ve struggled at the very beginning of my programming career to get polish letters to be properly displayed in HTML. However when I started BareMetalDev road I’ve decided to learn more. So here it comes.
Beginning of time – ASCII
The basic unit of memory in a computer science is a bit. As You know it contains one of two values – 1 or 0. Itself it does not tell us much. However, when combined with a couple of his friends things get interesting. The mostly known combination is 8 bits. Using 8 bits we can store a range of binary numbers from -127 to 127 (when it is signed) or from 0 to 255 if unsigned. To understand this concept we just have to understand that with signed numbers one bit is used to denote whether it’s negative or positive value. Therefore we got only 7 bits left to store the value. With 7 bits and 2 possible values the amount of combinations is 2 to the power 7 and that gives us 127. So far so good. If we use unsigned value there are 8 bits to use so it’s 2 to the power 8 which gives us 256.
So we got a possibility to store up to 127 or 255 values in single memory cell (remember, we are at the beginning of time). When we take under consideration that computers were invented in the western world where English was dominant language, there was no problem. ASCII table – contained actually all the characters You could possibly need. First 32 of them were control-codes, which were unprintable characters (like new line or beep indicator – number 32 is a code for SPACE). Then we had numbers and math symbols. Starting from 65 we had letter A and up to 122 where letter z resided. Then there were a couple of signs and that was it. Ok, You may ask, but what about other codes up to 255 when we are using unsigned value? And here will be dragons.
Code pages
Depending on the place where You were living – different characters were put for the codes above 127 causing a real mess. However, (and I remember this in the beginning of the 90s) as long as You were staying in one country and had OS that was supporting your local letters (using code pages) everything was fine. What is that exactly? Actually we can treat ASCII codes as some kind of code page. It’s just a mapping between number (identifier) and concrete letter/char that will be displayed when program/system encounters this number. Coming back to the number bigger than 127 in original ASCII – different code pages (set per OS) resulted in different characters being presented to the user (there is also extended ASCII table – which was the reason why programs like Norton Commander or Turbo Pascal IDE had graphic-like look). Every OS back at the time (mainframes/DOS/WINDOWS/etc) had its own code pages that were used. And as mentioned above – it worked fine for the time being. But then a new kid on the block appeared.
Here comes the internet
Suddenly we were not (most of the time) limited to one country and encoding. Email written in Japan could be sent to somebody living in Russia and using OS and applications that had problem with reading proper signs. And what about HTML page? I could create it using my local encoding (in Poland at the very beginning of internet ISO-8859-2 encoding was used). I still remember strange symbols that were appearing when viewing my home page when I visited them eg. from Linux-based machines. And if You had forgotten – there was still problem with displaying thousands of character for eg. Chineese language. Something had to be done.
Birth of the Unicode
Unicode can be used as a general word to describe project to create page of all code pages. The main goal was to be backwards compatible with default ASCII and ISO-8859-1 (which contained only latin characters) but also to offer the possibility to store whatever we want. Looking backwards we may say that it was a success – as of today over 180k different letter/chars/symbols/emojis can be represented in Unicode. How is that possible?
Unicode just left the idea of using only one byte to store character identifier. Of course as I’ve written above – it’s still backwards compatible with ASCII but if character is something completely out of scope – Unicode will just use more bytes to store the information about a character. It’s that simple.
Of course over time a decision had to be made – how many bytes to use in order to store char identifiers? I’ve mentioned Internet – it’s pretty obvious (especially in the beginning of Internet’s rise in popularity) that bandwidth was a bottleneck. We could have even used 8 bytes per character, but for a lot of use cases it would be an overkill. Therefore, a couple of sub-encodings were designed.
UTF-8
The most popular of them (especially in Internet today) is UTF-8. Its base is 8 bits (Thank You Captain obvious!), and they can be used up to 4 times – therefore it is called variable-width encoding. As You can see it gives enough space to encode practically all modern languages. The question is – how does parser/program knows that it should treat incoming 4 bytes as a single identifier rather than 4 different characters? Guys inventing Unicode thought of that and came with a simple solution.
If character is within old ASCII range – then first bit in the byte is set to value 0. Simple right?
Let’s go further. If character needs 2 or more bytes to be encoded then first bit of the first byte has a value 1 and every next byte that is a part of the encoding will start with 10 bit value. However – for every next byte additional 1 value for first bit will be added. Simple and elegant. Below You can see a table copied from the Wiki article about UTF-8 – in general I recommend linked section to see full working example of encoding.
Ok, but what about UTF-16?
Yeah, what about it? I do Java for a living. It is common knowledge that it uses UTF-16 as an encoding. The
same applies to JavaScript, Windows OS, and some other big guys. The answer to the question – why they are using UTF-16? – is simple. It is for historical reasons. Before UTF-16 was born dominant encoding was fixed-width USC-2 encoding, which was a product of ISO. What is more – this encoding was for some time called Unicode itself! USC-2 was fixed-width as I’ve already mentioned, using 2 bytes to store character numeric point. However along the way it became obvious, that it won’t be able to contain all new-shiny-non-latin-characters-in-the-whole-wide-world (and Klingon as a bonus) 😉 In order to be able to do that, but also do not break backward compatibility for existing code/API UTF-16 was chosen as a new standard. For more robust explanation I suggest reading Wikipedia article about history of UTF-16 and a concept of single character unit for BMP (Basic Multilingual Plane).
All right, let’s see how it works. In UTF-8 the basic character unit has 8 bits and therefore use one byte to store. For UTF-16 basic unit has 16 bits and can be used separately or doubled with its colleague. Hence we get up to 32 bits to use for encoding. That is a lot of space for encodings! However if you remember part about UTF-8 storing data in more than one byte, You could wonder how does UTF-16 handles this. The answer is simple and similar to the solution used in UTF-8. For every character unit (which remember – consists of 16 bits) first 6 bits was sacrificed to be an indicator whether it’s a single-character-unit or a second-character-unit. With full scope of UTF-16 we could have 32 bits at our disposal, however when we subtract 12 bits used for indicators we end up with 20 bits to be used. Still – it’s 2 to the power 20 which results in possible 1,112,064 numeric codes. Unless we discover new civilizations in a galaxy far, far away, we are covered.
What is worth mentioning is the unicode character page – an ultimate mapping list containing numeric points and corresponding characters. In order to avoid problems with the numeric points that were excluded (due to using indicating bits in UTF-16) – a part of the range is reserved and does not indicate any character. In hex it is the range of U+D800 to U+DFFF. Below is a table (taken again from Wiki on UTF-16) showing indicating bits.
At the end of this subchapter the most important thing must be said. UTF-16 is not ASCII compatible! As it’s using 16-bits for a single character unit, trying to squeeze simple 8 bits won’t work. That’s why UTF-16 did not win the internet but its ‘smaller’ cousin UTF-8 rules the world.
But I’ve heard about UTF-32 too
Yep, I’m sure You did. With the above knowledge it will be sufficient to write here that UTF-32 is a fixed-width 4 character unit encoding with character unit consisting of standard 8 bits. The biggest advantage of this encoding is constant-time sequential access. On the other side – hey – using 4 bytes to store simple ASCII characters? C’mon!
Enough Unicode, tell me about endianess and BOM
By the title of this subchapter You may think that endianess is not related to Unicode at all. The truth is
– it is heavily related, but let’s take one step at a time.
Endianess is a way that operating system stores/reads portions of data that require more than 1 byte to store. The same way that people use left-sided or right-sided write system, the same situation applies to the endianess. We have two types – big endian and little endian. In the big endian the most significant bit (MSB) is stored as a first, and rest follows. This is how people in the West write numbers. Eg. in the 1,567,321 the MSB is 1 milion and then all the rest follows. Big endian, however, is not the most popular – it is used in legacy architectures or mainframe ones (eg. IBM System 360, SPARC, Motorola 68000). Modern CPUs like x86 use little-endian that contrary to big-endian stores least significant bit (LSB) at the beginning. There are also architectures that supply bi-endianess, therefore having a possibility to change used endianess (such as MIPS, ARM v3, IA-64).
Ok, but why does it bother us? Because if we have to use 2 bytes to store some numerical point in UTF-8 encoding, we just have to know in which order the bytes should be read. We might get a document that was written using big endian, and our machine uses little endian. How to solve this puzzle? Here comes BOM (Byte order mark) to the rescue. It is nothing other than a simple indicator at the beginning of every document, that says which endianess was used to create a document. Unicode recommends to use character code identified by U+FEFF to identify big endianess and reversed version of it U+FFFE for little endianess. It is recommended to send BOM with every UTF-16 document – if there is none – big endian is set as default.
To sum up
As You can see it’s not rocket science, and the encoding in modern computers is not something to be scared off. I hope above article was useful for You. It’s also a foundation for more specific post about how C handles characters. You can find it here.
2 Comments
Leave a Reply
You must be logged in to post a comment.
September 2020 summary – Bare.Metal.Dev
[…] the code, the pointers itself are a concept that I cannot say I am that familiar with. Also, my post about encoding is just an introduction to this theme, and I want to write another post describing specifics of […]
How does C handle floating-point numbers – Bare.Metal.Dev
[…] contains a sign (1 for negative, 0 for positive), next 8bits are used to store expotent. Remember the discussion how many characters can be stored in single C char variable? Yes, You’re right, that depends on whether it’s signed or unsigned. Expotent can also […]