convert unicode to utf-8
Blogs20122012-09-21
I am converting Unicode 16-bits to UTF-8 for Chinese characters display. Here are the source of the conversion:
1. UTF-8 -> http://en.wikipedia.org/wiki/Utf8
2. CJK Unified Ideographs
1. Unicode CJK
The Chinese charset is set in the range of CJK.
The basic block named CJK Unified Ideographs (4E00–9FFF) contains 20,941 basic Chinese characters in the range U+4E00 through U+9FCC. The Charts are accessible here:
4E00-62FF, 6300-77FF, 7800-8CFF, 8D00-9FFF.
2. utf-8 Unicode table
What I going to do is to translate the right side Unicode to left-side UTF-8 3-bytes character.
| utf-8(3字节) | unicode(16位 - 用十六进制) |
|---|
|
3-byte
E_
| Indic
0800*
224 | Misc.
1000
225 | Symbol
2000
226 | Kana
CJK
3000
227 | CJK
4000
228 | CJK
5000
229 | CJK
6000
230 | CJK
7000
231 | CJK
8000
232 | CJK
9000
233 | Asian
A000
234 | Hangul
B000
235 | Hangul
C000
236 | Hangul
Surr
D000
237 | Priv Use
E000
238 | Forms
F000
239 |
3. unicode->utf8 convert Formular
For CJK set, there is 3-bytes utf8 for a unicode charactor(16-bits).
| Bits | Last code point | Byte 1 | Byte 2 | Byte 3 | Byte 4 | Byte 5 | Byte 6 |
|---|---|---|---|---|---|---|---|
| 7 | U+007F | 0xxxxxxx | |||||
| 11 | U+07FF | 110xxxxx | 10xxxxxx | ||||
| 16 | U+FFFF | 1110xxxx | 10xxxxxx | 10xxxxxx | |||
| 21 | U+1FFFFF | 11110xxx | 10xxxxxx | 10xxxxxx | 10xxxxxx | ||
| 26 | U+3FFFFFF | 111110xx | 10xxxxxx | 10xxxxxx | 10xxxxxx | 10xxxxxx | |
| 31 | U+7FFFFFFF | 1111110x | 10xxxxxx | 10xxxxxx | 10xxxxxx | 10xxxxxx | 10xxxxxx |
4. example
For chinese word ‘大’ (Unicode 0x5927), the convert from unicode to utf-8 are:
(1) 按照unicode转utf-8的编码规则,汉字使用3字节序列
所以套用三字节转换公式
0800 - FFFF
1110xxxx 10xxxxxx 10xxxxxx
其中用x代表的16位使用unicode相应的位来填充
(2) 0x5927转换为2进制0101 1001 0010 0111
填充到上面公式中的x中变成
11100101 10100100 10100111
用16进制表示为E5 A4 A7
(3) 验证方法为:
在浏览器地址栏中输入javascript:alert(encodeURI('大').replace(/%/g,'')),按回车。