![]() |
"FACE WITH TEARS OF JOY" (U+1F602) |
I've been fighting with characters sets on several occasions throughout the years. Just recently, I had a bug in TransformTool related to character encoding and how errors are handled in the .NET framework. While writing about the bug I needed a reference to a basic introduction to character encoding — only to discover that most are very technically focused and dive right into the characters' hex codes. Here, I'll try to fill that gap and explain only the basics. I'll include pointers to more detailed resources in case you decide to dig deeper into the dark world of character encodings.
How encodings work
The Unicode Consortium has a great explanation of how it really works:
Fundamentally, computers just deal with numbers. They store letters and other characters by assigning a number for each one.The number assigned to a character is called a codepoint. An encoding defines how many codepoints there are, and which abstract letters they represent e.g. "Latin Capital Letter A". Furthermore, an encoding defines how the codepoint can be represented as one or more bytes. We'll use one of the most prominent encodings as our first example: ASCII.
![]() |
Capital A in the ASCII encoding |
There, that was the big picture in a few paragraphs! That's how it works! Now we'll go more into detail on how characters are encoded, because that's usually where things go wrong. We'll leave the fonts, if you want to dig further into this see Understanding characters, keystrokes, codepoints and glyphs.
We've seen that ASCII assigns the number 65 to a capital A. But what about the other characters? Here's the uppercase characters in ASCII along with their (decimal) codepoints:
A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z |
65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | 73 | 74 | 75 | 76 | 77 | 78 | 79 | 80 | 81 | 82 | 83 | 84 | 85 | 86 | 87 | 88 | 89 | 90 |
And here's the lowercase characters and their codepoints:
a | b | c | d | e | f | g | h | i | j | k | l | m | n | o | p | q | r | s | t | u | v | w | x | y | z |
97 | 98 | 99 | 100 | 101 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | 109 | 110 | 111 | 112 | 113 | 114 | 115 | 116 | 117 | 118 | 119 | 120 | 121 | 122 |
There you go, that's the english alphabet in both lower- and uppercase. You can have a look at the complete table of printable ASCII characters at Wikipedia where you'll also find numbers, punctuation marks etc. Character encodings are often referred to as code pages or character sets as well.
There are (too) many encodings in common use around the world, each defining their own set of characters with corresponding numbers. Wikipedia lists over 50 common character encodings. The sheer number of encodings is one of the main reasons that things get messy.
How encodings differ
The Unicode Consortium summarizes the problems that arise due to all these different character encodings:
These encoding systems also conflict with one another. That is, two encodings can use the same number for two different characters, or use different numbers for the same character. Any given computer (especially servers) needs to support many different encodings; yet whenever data is passed between different encodings or platforms, that data always runs the risk of corruption.
To show some of the conflicts, we'll discuss two more common encodings, in addition to ASCII: the Latin-1 (ISO-8859-1) and Latin-2 (ISO-8859-2) character sets. Here's how they line up with with ASCII.
The first obvious problem here is that the two Latin encodings define more characters than ASCII do, so they have characters that do not exist in the ASCII-encoding. It's for example impossible for me to represent my name (André) using the ASCII encoding, but it's not a problem with Latin-1 nor Latin-2. The offending character is é, if you haven't already guessed it.
Moving on, the Latin-1 and Latin-2 encodings illustrate the problem of using the same number for two different characters. Here's a comparison for codepoints 192 through 199 for Latin-1 and Latin-2:
- ASCII is a seven bit encoding. Seven bits lets you count from 0 to 127. Consequently, you can represent 128 different characters.
- Latin-1 is an eight bit encoding. Eight bits (a byte) lets you count from 0 to 255. You could therefore theoretically represent 256 different characters, but 32 are unused, leaving 224 assigned. Latin-1 was defined to handle western European languages.
- Latin-2 is also an eight bit encoding, and also has 224 assigned characters. Latin-2 copes with Eastern European languages.
- Although Latin-1 and Latin-2 contain more characters than ASCII, they are identical to ASCII for the first 128 letters, and are consequently backwards compatible for those letters.
- Check out the links to have a look at what the tables of characters look like!
The first obvious problem here is that the two Latin encodings define more characters than ASCII do, so they have characters that do not exist in the ASCII-encoding. It's for example impossible for me to represent my name (André) using the ASCII encoding, but it's not a problem with Latin-1 nor Latin-2. The offending character is é, if you haven't already guessed it.
Moving on, the Latin-1 and Latin-2 encodings illustrate the problem of using the same number for two different characters. Here's a comparison for codepoints 192 through 199 for Latin-1 and Latin-2:
To summarize, if you write the word FÅRIKÅL to a text file using the Latin-1 encoding, here's how things can go wrong depending on your choice of encoding when reading the file:
- If you read the file using the ASCII encoding, the byte "11000101" cannot be decoded to a valid codepoint. You might get an error, or an replacement character such as: � or □. Or even worse, you might get an ?. More on that in an upcoming blog post on how .NET handles errors.
- If you read the file using the Latin-2 encoding, "11000101" will be decoded to a valid codepoint, which is assigned to the letter Ĺ. FÅRIKÅL then becomes FĹRIKĹL.
To further complicate things, there are encodings that use multiple bytes to store a character. I bet you can imagine that this can open yet another world of problems, since you need to keep track of several bytes. You're right, but it's also the only way to replace all the one-byte encodings which limits a character set to 256 characters.
There must be some kind of way out of here
Unicode comes to the rescue. Quoting the consortium again:
Unicode provides a consistent way of encoding multilingual plain text and brings order to a chaotic state of affairs that has made it difficult to exchange text files internationally.The Unicode standard defines more than 100 000 characters and their codepoints at the time of writing, but can potentially define more than one million characters. That means that there's no need for several character sets anymore, Unicode can include all characters. The big players in the IT industry work together to develop the standard further, ensuring support across platforms (Microsoft, Apple, Google and more).
There's three Unicode encoding forms, UTF-8, UTF-16. UTF-32. All of these can represent all Unicode characters. The most common encoding on the web is UTF-8, which you've probably come across. The text you're reading now is for example served as UTF-8. UTF-16 is also in widespread use, for example in the .Net framework and the Java runtime environment to represent strings in memory.
UTF-8 uses one, two, three, or four bytes to encode a character. It's backwards compatible with ASCII, which means that all the one byte characters are identical to ASCII. Other characters are stored using two, three or four bytes.
UTF-16 uses two or four bytes to encode a character, while UTF-32 uses four bytes per character. The figure shows how a capital A would be encoded.
![]() |
Latin Capital Letter A encoded forms |
Since you've tagged along this far in this post, here's a fun fact. Unicode defines not just characters but also lots of symbols. The crying smiley depicted in the begining of this post is actually a unicode character. It's called "Face with tears of joy." You'll find it here, along with many others.
I hope this post helped you grasp the overarching logic behind characters and their encoding in computers. If you really want to inflict more pain to the brain, I suggest you spend some time reading the references. You can also play with text encoding in TransformTool, it supports several encodings and can show you the bytes as decimal/hex/binary.
I've highlighted some common problems related to character encoding. When you're building new systems the advice is almost always: "Stick to UTF-8." It's also safe when communicating with legacy systems that use ASCII.
Note however, UTF-8 is NOT compatible if you communicate with systems that use anything other than UTF-8 or ASCII, such as the Latin-(1,2..X) encodings. Then you either have to change the system to use UTF-8, or use the same encoding as the system when reading the data on your side. Knowing just that might help you figure out things a lot faster when things start to break.
Good luck. ☺
PS! If you're a .NET head, stay tuned for an upcoming post on some .NET encoding subtleties. You don't want to miss those.
Nice article, but I think it misses a link to the "classic" blogpost about character encodings by Joel Spolsky (from 2003): http://www.joelonsoftware.com/articles/Unicode.html
ReplyDeleteدانلود آهنگ جدید
DeleteYou're right, for those who are ready for a more technical article that's a classic blogpost. Thanks for adding it!
ReplyDeleteSince you can tell what type of byte you're looking at from the first few parts, then even if something gets mangled somewhere, you don't drop the whole series.
ReplyDeleteIf you are a student, you dont have to miss such great educational blogs like this one https://silveressay.com/
ReplyDelete20170518 leilei3915
ReplyDeletemont blanc pens
pandora charms
coach factory outlet
michael kors handbags
lacoste shirts
mlb jerseys wholesale
polo shirts
michael kors outlet clearance
cheap mlb jerseys
ugg boots
coach outlet
ReplyDeletecheap nhl jerseys
oakley vault
coach outlet
michael kors outlet
christian louboutin
christian louboutin
polo ralph lauren
coach outlet
mont blanc pens
20179.21chenjinyan
This comment has been removed by the author.
ReplyDeleteشركة المثالية للتنظيف
ReplyDeleteThank you for the interesting news and another interesting story to follow.
ReplyDeleteRoyal Online
Pest Control services in Kirti Nagar
ReplyDeletePest Control services in Modi Nagar
Pest Control services in Patel Nagar
Pest Control services in Rajouri Garden
Pest Control services in Tilak Nagar
Pest Control services in Janakpuri
Pest Control services in Paschim Vihar
20180831xiaoke
ReplyDeletefreshjive clothing
coach factory outlet
canada goose jackets
polo ralph lauren shirts
canada goose
tory burch sandals
the north face
canada goose outlet online
coach factory outlet
harry winston jewelry
download my talking tom mod apk download asphalt nitro mod apk download dead trigger 2 mod apk
ReplyDeletesurveillancekart security system
ReplyDeletesurveillancekart cctv installation services
cp plus
Pestveda pest control services
dezigly
The feedgasm Latest News And Breaking News
quicksodes
latest news in hindi
This is an awesome post. Really very informative and creative contents. This concept is a good way to enhance knowledge. I like it and help me to development very well. Thank you for this brief explanation and very nice information. Well, got good knowledge.
ReplyDeletehttp://www.levelsncurves.com/wordpress-development-company-chennai-india//
Web Ocean Design is the best IT services provider for complete mobile and web application development. The young development company based in Bihar, India, owned and managed by Vicky who have a good amount of experience in Information Technology, Management and other related fields. We provide technical and creative services ranging from Internet Marketing to Communication maneuver. We are also skilled in website development which includes brand promotion, web designing and software development.
ReplyDeletewebsite design company in patna
website development company in patna
website development in patna
web design company in patna
web development company in patna
website design in patna
website design patna
seo company in patna
seo company in bihar
Web Ocean Design is the best IT services provider for complete mobile and web application development. The young development company based in Bihar, India, owned and managed by Vicky who have a good amount of experience in Information Technology, Management and other related fields. We provide technical and creative services ranging from Internet Marketing to Communication maneuver. We are also skilled in website development which includes brand promotion, web designing and software development.
ReplyDeletebest seo company in patna
digital marketing company in patna
best website design company in patna
affordable seo service in patna
website optimization in patna
educational internet marketing company patna
social media marketing company patna
real estate seo company in patna
ecommerce seo company patna
Not only in Solidworks, but also in other software related assignments, our experts can be extremely helpful. Our experts specialize in 3D mechanical design applications of Solidworks . When it comes to doing an assignment and you get stuck, don’t hesitate to get in touch with Online Assignment Expert for IT assignment help. If you are a university student and looking for assignment writing help then Online Assignment Expert is one stop solution. Our Statistics assignment writing experts in Australia serve the best Statistics assignment help to students in order to solve their queries.
ReplyDeleteGreat Article
ReplyDeleteIEEE Projects on Information Security
Project Centers in Chennai
JavaScript Training in Chennai
JavaScript Training in Chennai
I would really like to read some personal experiences like the way, you've explained through the above article. I'm glad for your achievements and would probably like to see much more in the near future. Thanks for share.
ReplyDeleteSql server dba online training
Our declaration to providing the best customer experience when it comes to the best custom essay writing service are guided by these principles we entrench daily in our delivery for the best online essay writing service firm.
ReplyDeleteThis helps me a lot. Thank you!
ReplyDeleteIf you want to change the password of your wifi network but you don't know how, 192.168.l.l has the tutorials you need.
Our Professional Medicine Essay Writers work tirelessly to ensure that your Medicine Research Paper Writing is completed within the time frame given to avoid poor scores in your Medical Essay Assignments Writing.
ReplyDeleteCustom Term Paper Service industry has grown steadily in provision of Legitimate Term Paper Services and high quality Custom Term Paper Writing Services which is preferred by scholars worldwide.
ReplyDeleteโบนัสเพียบ เล่นง่ายได้เยอะ joker123 ฟรีเครดิต slot online
ReplyDeletehttps://www.slotxd.com/jokergaming123
Best Corporate Video Production Company in Bangalore and top Explainer Video Company in comments , 3d, 2d Animation Video Makers in Chennai
ReplyDeleteAwesome article. good read blog. Thanks for sharing
Many readers were in search of this information because they want to solve this issue about encoding but now they are satisfied with this knowledge because through this they can solve their problems. Master dissertation writing service.
ReplyDeleteEvery student should get access to our Economics Essay Writing Services because we have professional writers who deliver Economics Dissertation Writing Services as well as offer Affordable academic Help Online that are original and authentic.
ReplyDeleteNice Post! Dubbing services should effectively deliver international content in the native languages of the target audience. Dubbing artists should have immense experience in the field to proficiently dub videos by accurately conveying emotions.
ReplyDeletevoice over services
Đại lý Aivivu chuyên cung cấp vé máy bay, tham khảo
ReplyDeleteve may bay tet 2021
Ve may bay di My
vé máy bay đi Pháp giá rẻ 2020
bay từ việt nam sang hàn quốc mất mấy tiếng
đặt vé máy bay tại nhật
thời gian bay từ Hà Nội sang Anh
săn vé máy bay 0 đồng
Excellent blog with valuable information and just added your blog to my bookmarking sites thank for sharing.
ReplyDeleteData Science Course in Bangalore
Top quality blog with unique content and information shared was valuable looking forward for next updated thank you.
ReplyDeleteEthical Hacking Course in Bangalore
Aivivu chuyên vé máy bay, tham khảo
ReplyDeletevé máy bay đi Mỹ hạng thương gia
vé máy bay quốc tế từ mỹ về việt nam
vé máy bay vietjet từ nhật về việt nam
vé máy bay giá rẻ từ Canada về Việt Nam
skycut plotter india
experts
mobileskinsoftware
silhouette cameo 4
mobileskinsoftware
ambition gifts
top sublimation
wemaketrips
f ff ff fffff
ReplyDeletei am glad to discover this page : i have to thank you for the time i spent on this especially great reading !! i really liked each part and also bookmarked you for new information on your site.
ReplyDeletedata science training in bangalore
I want to leave a little comment to support and wish you the best of luck.we wish you the best of luck in all your blogging enedevors.
ReplyDeletebest data science courses in hyderabad
I Want to leave a little comment to support and wish you the best of luck.we wish you the best of luck in all your blogging endeavors.
ReplyDeletecyber security course in bangalore
I am a new user of this site, so here I saw several articles and posts published on this site, I am more interested in some of them, hope you will provide more information on these topics in your next articles.
ReplyDeletedata analytics training in bangalore
Informative blog
ReplyDeletedata analytics courses in hyderabad
Thanks for posting the best information and the blog is very helpful.data science institutes in hyderabad
ReplyDelete
ReplyDeleteThis is an excellent post I saw thanks for sharing it. It is really what I wanted to see. I hope in the future you will continue to share such an excellent post.
best data science institute in hyderabad
i am glad to discover this page : i have to thank you for the time i spent on this especially great reading !! i really liked each part and also bookmarked you for new information on your site.
ReplyDeletebest data science courses in bangalore
I just got to this amazing site not long ago. I was actually captured with the piece of resources you have got here. Big thumbs up for making such wonderful blog page!
ReplyDeletedata analytics course in bangalore
I am glad to discover this page. I have to thank you for the time I spent on this especially great reading !! I really liked each part and also bookmarked you for new information on your site.
ReplyDeleteData Science Training in Chennai
I am a new user of this site, so here I saw several articles and posts published on this site, I am more interested in some of them, hope you will provide more information on these topics in your next articles.
ReplyDeletedata analytics training in bangalore
Thanks for posting the best information and the blog is very important.digital marketing institute in hyderabad
ReplyDeleteInformative blog
ReplyDeletebest digital marketing institute in hyderabad
I read your post and I found it amazing! thank!
ReplyDeletebest data science course online
I am glad to discover this page. I have to thank you for the time I spent on this especially great reading !! I really liked each part and also bookmarked you for new information on your site.
ReplyDeleteData Science Training in Chennai
Excellent effort to make this blog more wonderful and attractive.
ReplyDeletedigital marketing courses in hyderabad with placement
i am glad to discover this page : i have to thank you for the time i spent on this especially great reading !! i really liked each part and also bookmarked you for new information on your site.
ReplyDeleteartificial intelligence training in chennai
Terrific post thoroughly enjoyed reading the blog and more over found to be the tremendous one. In fact, educating the participants with it's amazing content. Hope you share the similar content consecutively.
ReplyDeletedata science course in varanasi
I want to leave a little comment to support and wish you the best of luck.we wish you the best of luck in all your blogging enedevors.
ReplyDeletedata science training in chennai
laser hair growth cap
ReplyDeletelaser hair regrowth cap
Hair Regrowth Cap
Hair growth Cap
Hair Regrowth Helmet
hair regrowth laser cap
Hair Regrowth Treatment Cap
育毛
育毛キャップ
hair growth reviews
hair regrowth reviews
Best Hair Regrowth
Best Hair growth
Best Hair Regrowth Products
Best Hair growth Products
Hair Growth Laser
Hair Regrowth Laser
hair grow faster
hair grows back
Hair Growth
Hair Regrowth
hair growth products for sale
Hair growth Helmet
Hair Regrowth Treatment
Hair Regrowth Treatment Helmet
Hair Growth For Men
Hair Growth For Women
You have completed certain reliable points there. I did some research on the subject and found that almost everyone will agree with your blog...
ReplyDeleteDevOps Training in Hyderabad
Wonderful article, thanks for putting this together! This is obviously one great post. Thanks for the valuable information and insights you have so provided here...
ReplyDeleteDevOps Training in Hyderabad
Thanks for writing in and taking the time to write down your thoughts. Not many people would be as encouraging as you, because many would have easily been discouraged by what you were saying.
ReplyDeleteData Science Training in Hyderabad
Data Science Course in Hyderabad
I was looking for information on the internet, and I found your blog. I'm impressed with how informative it is!
ReplyDeleteAWS Training in Hyderabad
AWS Course in Hyderabad
Fantastic!! you are doing good job! I impressed. Many bodies are follow to you and try to some new.. After read your comments I feel; Its very interesting and every guys sahre with you own works. Great!!
ReplyDeletevé máy bay từ mỹ về việt nam hãng eva
chuyến bay từ pháp về việt nam hôm nay
đặt vé máy bay từ singapore về việt nam
vé máy bay từ úc về việt nam bao nhiêu
mở lại đường bay việt nam - hàn quốc
Ve may bay Vietjet tu Nhat Ban ve Viet Nam
The blog is informative and very useful therefore, I would like to thank you for your effort in writing this article.
ReplyDeleteData Analytics Course in Lucknow
Hi, Thanks for sharing nice articles...
ReplyDeleteGram Panchayat RTI
A manual assessment of the quality of your site's content (how well it's written, the spelling and grammar), the amount of content and its keyword usage will help to gauge your website's standard. formpl us
ReplyDeleteShe qualifies for Insurance through her job at Macy’s, and she at first attributed her lack of Insurance to “procrastination.” Later, though, she admitted that it seemed expensive, and that people had told her it would cost more to have dental Insurance than to simply pay out of pocket, but she wasn’t sure. What Is Title Insurance In Alberta
ReplyDelete뱃할맛이 나는곳 먹튀검증 안전한메이져
ReplyDelete