• Blogs (9)
    • 📱 236 - 992 - 3846

      📧 jxjwilliam@gmail.com

    • Version: ‍🚀 1.1.0
  • convert unicode to utf-8

    Blogs20122012-09-21


    I am converting Unicode 16-bits to UTF-8 for Chinese characters display. Here are the source of the conversion:

    1. UTF-8 -> http://en.wikipedia.org/wiki/Utf8
    2. CJK Unified Ideographs

    1. Unicode CJK

    The Chinese charset is set in the range of CJK.

    The basic block named CJK Unified Ideographs (4E00–9FFF) contains 20,941 basic Chinese characters in the range U+4E00 through U+9FCC. The Charts are accessible here:

    4E00-62FF, 6300-77FF, 7800-8CFF, 8D00-9FFF.

    2. utf-8 Unicode table

    What I going to do is to translate the right side Unicode to left-side UTF-8 3-bytes character.

    utf-8(3字节) unicode(16位 - 用十六进制)

    |  
    3-byte
    E_
    | Indic
    0800*
    224 | Misc.
    1000
    225 | Symbol
    2000
    226 | Kana
    CJK
    3000
    227 | CJK
    4000
    228 | CJK
    5000
    229 | CJK
    6000
    230 | CJK
    7000
    231 | CJK
    8000
    232 | CJK
    9000
    233 | Asian
    A000
    234 | Hangul
    B000
    235 | Hangul
    C000
    236 | Hangul
    Surr
    D000
    237 | Priv Use
    E000
    238 | Forms
    F000
    239 |

    3. unicode->utf8 convert Formular

    For CJK set, there is 3-bytes utf8 for a unicode charactor(16-bits).

    Bits Last code point Byte 1 Byte 2 Byte 3 Byte 4 Byte 5 Byte 6
      7 U+007F 0xxxxxxx
    11 U+07FF 110xxxxx 10xxxxxx
    16 U+FFFF 1110xxxx 10xxxxxx 10xxxxxx
    21 U+1FFFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
    26 U+3FFFFFF 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
    31 U+7FFFFFFF 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

    4. example

    For chinese word ‘大’ (Unicode 0x5927), the convert from unicode to utf-8 are:

    (1) 按照unicode转utf-8的编码规则,汉字使用3字节序列
    所以套用三字节转换公式
    0800 - FFFF
    1110xxxx 10xxxxxx 10xxxxxx
    其中用x代表的16位使用unicode相应的位来填充
    
    (2) 0x5927转换为2进制0101 1001 0010 0111
    填充到上面公式中的x中变成
    11100101 10100100 10100111
    用16进制表示为E5 A4 A7
    
    (3) 验证方法为:
    在浏览器地址栏中输入javascript:alert(encodeURI('大').replace(/%/g,'')),按回车。