UTF handling in C
As You probably know I’m a Java developer, and a concept of String is my everyday meal. However, in C there is no such thing as string-type. Of course there is no problem with using string literals as a value assigned to the variable. However, You may wonder – what kind of variable – and here comes the concept of character type. The funny thing is – it can accept numbers too. This happens as under the hood C processes letters as numbers. Consider following code:
#include <stdio.h> int main() { char a = 65; printf("%d is a number", a); printf("\n"); printf("%c is a character", a); return 0; }
The output is:
65 is a number
A is a character
In order to understand how is it possible to have a cookie and eat a cookie we must go back a little in time.
Back to the beginning – how does C handle characters?
Actually – it doesn’t! If You take a look at a question I’ve asked on SO regarding this – I was given a simple answer. As long as C program is involved – value of a char variable is just one byte. Every call like
sizeof(variable_of_type_char)
will return 1.
C program itself treats char as a number. However, when it comes to putting something on the screen it’s not a problem. Because, it’s not plain C that prints the output. What C actually does is just to pass a stream of bytes to the standard output. As there is usually operating system application running (terminal in Linux, GUI window/console handled by Windows or whatever else) – it’s still the standard output to process and display bytes coming from the app properly. That’s all folks!
As long as You stay within the reach of ASCII that’s fine. However, what happens when we want to store character which numerical point is above the scope? There is one simple solution to this problem – just use char []. Using array is safe as no matter how many bytes is used to store numerical point for given character it will be handled by the array itself. Want to use any kind of sign from classical Chinese? No problem, array to the rescue.
However, there were attempts to make it better…
Yep. There were. And to be honest when I was reading about character variables in two separate books there were two different approaches to it. Stephen Prata in his ‘C Primer Plus, 6th Edition’ (2014) actually does not mention the name UTF at all! Characters are characters, we use basic ASCII-signs and we are covered. Which was actually strange to see. My other book authored by Peter Prinz, Tony Crawford titled ‘C in a nutshell’ mentions briefly (maybe it took half of the page?) types wchar_t, char16_t and char32_t. But just points to the header files these types are defined in, and that’s all. Ok, I’ve learned quite a lot.
Joking aside. Internet came to help me and therefore here is what I’ve learned. wchar_t type was introduced by C89 standard and its goal was to represent all the characters possible. Letter ‘w’ in the name is translated as wide, so this type can be used to store wide characters. The problem is that the assumption was wrong – as wchar_t is very often stored in 2 bytes (however, it can be also defined as 32bit integer, but at the same time in implementations that does not support internalisation it can have the same length as simple char) – it definatelycannot store all-the-time-growing-list-od-emojis. Of course, it is understood – who was thinking about emojis back in the 1990s. As Wikipedia says:
The width of wchar_t is compiler-specific and can be as small as 8 bits. Consequently, programs that need to be portable across any C or C++ compiler should not use wchar_t for storing Unicode text. The wchar_t type is intended for storing compiler-defined wide characters, which may be Unicode characters in some compilers.
People responsible for C standards saw this problem and in C11 standard they’ve added two new types – char16_t and char32_t. Their goal was to store assigned character in specific encoding (UTF-16 or UTF-32 – having appropriate size to contain it). However, as Robert Seacord, an author of third book I got in my possession – ‘Effective C’ states:
C11 doesn’t include library functions for the new data types, except for one set of character conversion functions, to allow library developers to implement them. C defines two environment macros that indicate how characters represented in these types are encoded. If the environment macro __STDC_UTF_16__ has the value 1, values of type char16_t are UTF-16 encoded. If the environment macro __STDC_UTF_32__ has the value 1, values of type char32_t are UTF-32 encoded. If the macro isn’t defined, another implementation-defined encoding is used.
Whoa! It seems that this subject is way more complicated than I’ve expected. However, I assume I will be more familiar with it when it eventually comes to dig into some real-world project. For the time being I think that I know enough to satisfy my interest.
There were a couple of online resources I’ve used and recommend for further reading.
One Comment
Leave a Reply
You must be logged in to post a comment.
Bytes, mappings, encodings and all that jazz – Bare.Metal.Dev
[…] As You can see it’s not rocket science, and the encoding in modern computers is not something to be scared off. I hope above article was useful for You. It’s also a foundation for more specific post about how C handles characters. You can find it here. […]