Actually, I have been confused about character encoding
for a long time, and really can not figure out the relation with Unicode
and UTF-8
. And after my whole afternoon effort, googling many information and reading lots of posts, finally I get a preliminary understanding about it. I will try my best to explain it distinctly.
Before Unicode
It would be a long story to explain how unicode comes. Before its existence, a character encoding named ASCII was created by America, is trying to rule the relation with English and Binary. And one byte corresponds one character. For more detail, see there.
However, ASCII only includes 128 characters encoding. As for other languages, it is not enough. Hence, many charsets based on ASCII appears like ISO 8859, trying to extend more characters encoding to express more language.
Unicode
There are many encoding existing all over the world. And it would be handy if there is a encoding includes all characters, and every character corresponds one unique encoding. Hence, Unicode
exists, trying to make it. Unicode
is a charset which can include more than one millionsymbols, and every character encoding is unique. And Unicode
identifies characters by a name and an interger number called its code point. For example, © is named “copyright sign” and has U+00A9
- 0xA9
can bewritten as 169
in decimal - as its code point.
On Unicode wiki, we can learn that the Unicode code space is divided into seventeen planes of 2^16 code points each. Some of these code points have not yet been assigned character values, some are reserved for private use, and some are permanently reserved as non-characters. The code points in each
plane have the hexadecimal values xy0000
to xyFFFF
, where xy
is a hex value from 00
to 10
, signifying which plane
the values belong to. Usually we use the first plane(xy is 00) most, which called the Basic Multilingual Plane or BMP. It contains the code points from U+0000
to U+FFFF
. And it may express one character more than one byte.
Problem
However, Unicode
is just a symbol set, which doesn't rule how to
save binary code. Although Unicode
like U+4E25
can even
express a Chinese character like 严
, how can computer regard it as
ASCII
or Unicode
?
As we know, an english alphabet just needs one byte, if we rule Unicode
uniformly, it can cause many waste.
Implement
Hence, based on such problem, many implementations appear try to make it, like UTF-8
, UTF-16
, UCS-2
and many more.
UTF-16 && UCS-2
UTF-16
is an implement which express a character using two or four bytes, while UCS-2
just uses two bytes.
UTF-8
Actually, UTF-8
is the most popular one in internet. It is flexible
to express a character using one to four bytes, and can change the length
of byte according to different character. Next is a rule of the UTF-8
encoding.
/*
Unicode | UTF-8
hexadecimal | binary
0000 0000-0000 007F | 0xxxxxxx One byte
0000 0080-0000 07FF | 110xxxxx 10xxxxxx Two bytes
0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx Three bytes
0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
*/
End
This is just my little learning about unicode, if you want to learn more details, visit wiki to have fun.