What is Character Encoding? Unicode?

 Character encoding is one of the most important and least understood aspects of programming, and computers in general. In its most basic sense, encoding is the way that a computer reads and displays a file in a way that humans can understand.


Every text file, no matter what language it is written in, is really just a big collection of 0's and 1's. This is called binary, and it's kind of important. But since most normal humans can't read in binary, computers take those binary numbers and convert them into characters—letters and numbers and punctuation—that make sense to people. How the computer does this for any given file is called the encoding. As an analogy, think back to what you learned in algebra. The binary numbers are x, and the characters that need to be displayed are y. Encoding is the equation that connects them.

 Because binary numbers don't really mean any specific character on their own, encoding also defines what set of characters the 0's and 1's will be converted to.

Since computers were first developed in the United States, the first character sets only included characters used in English: a-z, A-Z, 0-9, and some common punctuation. All of these characters fit into 128 places, which conveniently only took up 7 bits. This character set is called the American Standard Code for Information Interchange, or ASCII for short. ASCII only requires 7 bits, and computers at the time were capable of running 8. In order to display characters from their own languages, groups all over the world used the 8th bit to add an extra 128 characters, for a total of 256. The result was that every language had a different way of displaying the same file.

For example, the same combination of numbers in Windows-1250, an encoding for Central European languages, displays Ć, while Windows-1252, a Western European encoding, displays æ.

Even between encodings for the same languages, differences exist. Two different standards, Windows and ISO, both include encodings for Central European. The character sets are nearly as different as in the previous example.

So this went on for a while before transferring files internationally became commonplace, and every language had its own way of reading those 0s and 1s. The first 127 characters were the same, and consisted of non-accented English characters, while the rest of the characters varied wildly depending on the language.

Sending files to another country, or even just to someone who's computer didn't use exactly the same encoding as yours, resulted in all of your nice, language-specific characters getting transformed into some other characters that didn't make sense in your file. Plus, then there were those characters that weren't supported at all in other languages, so you'd just get rows of boxes or question marks. Don't even get started on Asian languages, which employ thousands of characters that can't possibly be represented in 256 places. (Asian languages used something called the Double Byte Character Set in which some characters take one byte and some take two.)

Thankfully, the nice people at the Unicode Consortium came along to sort it all out by the time the Internet showed up.

Unicode is an effort to pair every single character in every human language with a Unicode number, or code point. So far, they have over 100,000. That's a lot more than 256! An A in Unicode is the same as an A, but different than an a. All A's are identified by their Unicode code point (U+0041).

At first, people came up with the brilliant idea of representing every Unicode character with two bytes. This system was called UCS-2. It worked decently well, but since switching to UCS-2 required converting all of those old ASCII, Windows, and IBM documents into UCS-2, it didn't catch on right away.

And then UTF-8 was born. In UTF-8, every code point from 0-127 is stored in only one byte, while those above 127 are stored using 2-6 bytes. This has the added advantage of being identical to ASCII for English text, which means that anything written in English looks exactly the same in UTF-8 as it did in older encoding methods. Handy, right? Right.

UCS-4, UTF-16, UTF-32, and UTF-7 are all different ways of encoding the Unicode code points using varying numbers of bytes, etc.

Remember those Unicode code points? The ones that cover over 100,000 characters? Well, if you don't want to use UTF-8, you can still encode any of those code points in an older encoding system, as long as the system supports the character you're trying to use. You still can't display Russian letters in the Western European Windows-1252, you'll just get question marks and boxes. With the Unicode encodings (UTF-8, etc.), any character can be properly displayed.

No matter what encoding system you're using, it's important to let people know. The top of any HTML document should contain a meta tag that tells web browsers which encoding to use to read it. It should look something like this: 

<meta http-equiv="Content-Type" content="text/html; charset=utf-8"> 

 This meta tag needs to be the very first thing in your file after the html and head tags, otherwise the browser will get confused and display your page in whatever encoding it thinks is right.


Finally, in order to use any encoding when creating or editing documents, you need to have a text editor that supports that encoding. Many text editors (other than Notepad, of course), allow you to change the encoding you use for a file. And, of course, some text editors offer support for more encodings than others. Figure out what you need and find a text editor that fully supports your choice of encoding.