Disclaimer

Any opinions expressed here are my own and not necessarily those of my employer (I'm self-employed).

Jan 8, 2012

Introduction to character encoding

"FACE WITH TEARS OF JOY" (U+1F602)
Text encoding is a persistent source of pain and problems, especially when you need to communicate textual information across different systems. Every time you read or create an xml-file, a text file, a web page, or an e-mail, the text is encoded in some way. If the encoding is messed up along the way, the receiver will be looking at strange characters instead of the ori�inal t□xt. (ba-da-bing :)

I've been fighting with characters sets on several occasions throughout the years. Just recently, I had a bug in TransformTool related to character encoding and how errors are handled in the .NET framework. While writing about the bug I needed a reference to a basic introduction to character encoding — only to discover that most are very technically focused and dive right into the characters' hex codes. Here, I'll try to fill that gap and explain only the basics. I'll include pointers to more detailed resources in case you decide to dig deeper into the dark world of character encodings.

How encodings work
The Unicode Consortium has a great explanation of how it really works:
Fundamentally, computers just deal with numbers. They store letters and other characters by assigning a number for each one.
The number assigned to a character is called a codepoint. An encoding defines how many codepoints there are, and which abstract letters they represent e.g. "Latin Capital Letter A". Furthermore, an encoding defines how the codepoint can be represented as one or more bytes. We'll use one of the most prominent encodings as our first example: ASCII.

Capital A in the ASCII encoding
Note that an encoding does not determine what a character should look like on your screen, that's taken care of by fonts. The figure shows how two different fonts give two different graphical representations of the A, though it's still the same character.

There, that was the big picture in a few paragraphs! That's how it works! Now we'll go more into detail on how characters are encoded, because that's usually where things go wrong. We'll leave the fonts, if you want to dig further into this see Understanding characters, keystrokes, codepoints and glyphs.

We've seen that ASCII assigns the number 65 to a capital A. But what about the other characters? Here's the uppercase characters in ASCII along with their (decimal) codepoints:

ABCDEFGHIJKLMNOPQRSTUVWXYZ
 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90

And here's the lowercase characters and their codepoints:

abcdefghijklmnopqrstuvwxyz
979899100101102103104105106107108109110111112113114115116117118119120121122

There you go, that's the english alphabet in both lower- and uppercase. You can have a look at the complete table of printable ASCII characters at Wikipedia where you'll also find numbers, punctuation marks etc. Character encodings are often referred to as code pages or character sets as well.

There are (too) many encodings in common use around the world, each defining their own set of characters with corresponding numbers. Wikipedia lists over 50 common character encodings. The sheer number of encodings is one of the main reasons that things get messy.

How encodings differ
The Unicode Consortium summarizes the problems that arise due to all these different character encodings:
These encoding systems also conflict with one another. That is, two encodings can use the same number for two different characters, or use different numbers for the same character. Any given computer (especially servers) needs to support many different encodings; yet whenever data is passed between different encodings or platforms, that data always runs the risk of corruption.
To show some of the conflicts, we'll discuss two more common encodings, in addition to ASCII: the Latin-1 (ISO-8859-1) and Latin-2 (ISO-8859-2) character sets. Here's how they line up with with ASCII.

  • ASCII is a seven bit encoding. Seven bits lets you count from 0 to 127. Consequently, you can represent 128 different characters.
  • Latin-1 is an eight bit encoding. Eight bits (a byte) lets you count from 0 to 255. You could therefore theoretically represent 256 different characters, but 32 are unused, leaving 224 assigned. Latin-1 was defined to handle western European languages.
  • Latin-2 is also an eight bit encoding, and also has 224 assigned characters. Latin-2 copes with Eastern European languages.
  • Although Latin-1 and Latin-2 contain more characters than ASCII, they are identical to ASCII for the first 128 letters, and are consequently backwards compatible for those letters.
  • Check out the links to have a look at what the tables of characters look like!

The first obvious problem here is that the two Latin encodings define more characters than ASCII do, so they have characters that do not exist in the ASCII-encoding. It's for example impossible for me to represent my name (André) using the ASCII encoding, but it's not a problem with Latin-1 nor Latin-2. The offending character is é, if you haven't already guessed it.

Moving on, the Latin-1 and Latin-2 encodings illustrate the problem of using the same number for two different characters. Here's a comparison for codepoints 192 through 199 for Latin-1 and Latin-2:

Some characters match, but if you look at e.g. character number 197, you see that the same number maps to different characters in the two charsets. Mistakingly reading an e-mail with the Latin-2 encoding instead of the Latin-1 encoding would for example change the Norwegian word FÅRIKÅL to FĹRIKĹL. (Never heard of fårikål? :) As the Ĺ character does not exist in the Norwegian alphabet, this would be gibberish to an average Norwegian.

To summarize, if you write the word FÅRIKÅL to a text file using the Latin-1 encoding, here's how things can go wrong depending on your choice of encoding when reading the file:
  • If you read the file using the ASCII encoding, the byte "11000101" cannot be decoded to a valid codepoint. You might get an error, or an replacement character such as: � or □. Or even worse, you might get an ?. More on that in an upcoming blog post on how .NET handles errors.
  • If you read the file using the Latin-2 encoding,  "11000101" will be decoded to a valid codepoint, which is assigned to the letter Ĺ. FÅRIKÅL then becomes FĹRIKĹL.
These examples show why you have to be careful about what encoding you're using to read and write text, to avoid any loss of data.

To further complicate things, there are encodings that use multiple bytes to store a character. I bet you can imagine that this can open yet another world of problems, since you need to keep track of several bytes. You're right, but it's also the only way to replace all the one-byte encodings which limits a character set to 256 characters.

There must be some kind of way out of here
Unicode comes to the rescue. Quoting the consortium again:
Unicode provides a consistent way of encoding multilingual plain text and brings order to a chaotic state of affairs that has made it difficult to exchange text files internationally.
The Unicode standard defines more than 100 000 characters and their codepoints at the time of writing, but can potentially define more than one million characters. That means that there's no need for several character sets anymore, Unicode can include all characters. The big players in the IT industry work together to develop the standard further, ensuring support across platforms (Microsoft, Apple, Google and more).

There's three Unicode encoding forms, UTF-8, UTF-16. UTF-32. All of these can represent all Unicode characters. The most common encoding on the web is UTF-8, which you've probably come across. The text you're reading now is for example served as UTF-8. UTF-16 is also in widespread use, for example in the .Net framework and the Java runtime environment to represent strings in memory.

UTF-8 uses one, two, three, or four bytes to encode a character. It's backwards compatible with ASCII, which means that all the one byte characters are identical to ASCII. Other characters are stored using two, three or four bytes.

UTF-16 uses two or four bytes to encode a character, while UTF-32 uses four bytes per character. The figure shows how a capital A would be encoded.
Latin Capital Letter A encoded forms
If you want to learn more about Unicode and the different encodings, you should spend some time on Understanding Unicode. There's a lot to wrap your head around if you start digging into the details.

Since you've tagged along this far in this post, here's a fun fact. Unicode defines not just characters but also lots of symbols. The crying smiley depicted in the begining of this post is actually a unicode character. It's called "Face with tears of joy." You'll find it here, along with many others.

I hope this post helped you grasp the overarching logic behind characters and their encoding in computers. If you really want to inflict more pain to the brain, I suggest you spend some time reading the references. You can also play with text encoding in TransformTool, it supports several encodings and can show you the bytes as decimal/hex/binary.

I've highlighted some common problems related to character encoding. When you're building new systems the advice is almost always: "Stick to UTF-8." It's also safe when communicating with legacy systems that use ASCII.

Note however, UTF-8 is NOT compatible if you communicate with systems that use anything other than UTF-8 or ASCII, such as the Latin-(1,2..X) encodings. Then you either have to change the system to use UTF-8, or use the same encoding as the system when reading the data on your side. Knowing just that might help you figure out things a lot faster when things start to break.

Good luck. ☺

PS! If you're a .NET head, stay tuned for an upcoming post on some .NET encoding subtleties. You don't want to miss those.

29 comments:

  1. Nice article, but I think it misses a link to the "classic" blogpost about character encodings by Joel Spolsky (from 2003): http://www.joelonsoftware.com/articles/Unicode.html

    ReplyDelete
  2. You're right, for those who are ready for a more technical article that's a classic blogpost. Thanks for adding it!

    ReplyDelete
  3. Since you can tell what type of byte you're looking at from the first few parts, then even if something gets mangled somewhere, you don't drop the whole series.

    ReplyDelete
  4. If you are a student, you dont have to miss such great educational blogs like this one https://silveressay.com/

    ReplyDelete
  5. This comment has been removed by the author.

    ReplyDelete
  6. Thank you for the interesting news and another interesting story to follow.


    Royal Online

    ReplyDelete
  7. I would really like to read some personal experiences like the way, you've explained through the above article. I'm glad for your achievements and would probably like to see much more in the near future. Thanks for share.
    Sql server dba online training

    ReplyDelete
  8. This helps me a lot. Thank you!
    If you want to change the password of your wifi network but you don't know how, 192.168.l.l has the tutorials you need.

    ReplyDelete
  9. Best Corporate Video Production Company in Bangalore and top Explainer Video Company in comments , 3d, 2d Animation Video Makers in Chennai
    Awesome article. good read blog. Thanks for sharing

    ReplyDelete
  10. Many readers were in search of this information because they want to solve this issue about encoding but now they are satisfied with this knowledge because through this they can solve their problems. Master dissertation writing service.

    ReplyDelete
  11. Nice Post! Dubbing services should effectively deliver international content in the native languages of the target audience. Dubbing artists should have immense experience in the field to proficiently dub videos by accurately conveying emotions.
    voice over services

    ReplyDelete

  12. This is an excellent post I saw thanks for sharing it. It is really what I wanted to see. I hope in the future you will continue to share such an excellent post.

    best data science institute in hyderabad

    ReplyDelete
  13. You have completed certain reliable points there. I did some research on the subject and found that almost everyone will agree with your blog...

    DevOps Training in Hyderabad

    ReplyDelete
  14. Wonderful article, thanks for putting this together! This is obviously one great post. Thanks for the valuable information and insights you have so provided here...

    DevOps Training in Hyderabad

    ReplyDelete
  15. Hi, Thanks for sharing nice articles...

    Gram Panchayat RTI

    ReplyDelete
  16. A manual assessment of the quality of your site's content (how well it's written, the spelling and grammar), the amount of content and its keyword usage will help to gauge your website's standard. formpl us

    ReplyDelete
  17. This comment has been removed by the author.

    ReplyDelete

Copyright notice

© André N. Klingsheim and www.dotnetnoob.com, 2009-2018. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts and links may be used, provided that full and clear credit is given to André N. Klingsheim and www.dotnetnoob.com with appropriate and specific direction to the original content.

Read other popular posts