Learn about unicode

Actually, I have been confused about character encoding for a long time, and really can not figure out the relation with Unicode and UTF-8. And after my whole afternoon effort, googling many information and reading lots of posts, finally I get a preliminary understanding about it. I will try my best to explain it distinctly.

Before Unicode

It would be a long story to explain how unicode comes. Before its existence, a character encoding named ASCII was created by America, is trying to rule the relation with English and Binary. And one byte corresponds one character. For more detail, see there.

However, ASCII only includes 128 characters encoding. As for other languages, it is not enough. Hence, many charsets based on ASCII appears like ISO 8859, trying to extend more characters encoding to express more language.

Unicode

There are many encoding existing all over the world. And it would be handy if there is a encoding includes all characters, and every character corresponds one unique encoding. Hence, Unicode exists, trying to make it. Unicode is a charset which can include more than one millionsymbols, and every character encoding is unique. And Unicode identifies characters by a name and an interger number called its code point. For example, © is named “copyright sign” and has U+00A9 - 0xA9 can bewritten as 169 in decimal - as its code point.

On Unicode wiki, we can learn that the Unicode code space is divided into seventeen planes of 2^16 code points each. Some of these code points have not yet been assigned character values, some are reserved for private use, and some are permanently reserved as non-characters. The code points in each plane have the hexadecimal values xy0000 to xyFFFF, where xy is a hex value from 00 to 10, signifying which plane the values belong to. Usually we use the first plane(xy is 00) most, which called the Basic Multilingual Plane or BMP. It contains the code points from U+0000 to U+FFFF. And it may express one character more than one byte.

Problem

However, Unicode is just a symbol set, which doesn't rule how to save binary code. Although Unicode like U+4E25 can even express a Chinese character like 严, how can computer regard it as ASCII or Unicode?

As we know, an english alphabet just needs one byte, if we rule Unicode uniformly, it can cause many waste.

Implement

Hence, based on such problem, many implementations appear try to make it, like UTF-8, UTF-16, UCS-2 and many more.

UTF-16 && UCS-2

UTF-16 is an implement which express a character using two or four bytes, while UCS-2 just uses two bytes.

UTF-8

Actually, UTF-8 is the most popular one in internet. It is flexible to express a character using one to four bytes, and can change the length of byte according to different character. Next is a rule of the UTF-8 encoding.

/*
Unicode             | UTF-8
hexadecimal         | binary
0000 0000-0000 007F | 0xxxxxxx     One byte
0000 0080-0000 07FF | 110xxxxx 10xxxxxx    Two bytes
0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx    Three bytes
0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 
*/

End

This is just my little learning about unicode, if you want to learn more details, visit wiki to have fun.