• Blogs (9)
    • 📱 236 - 992 - 3846

      📧 jxjwilliam@gmail.com

    • Version: ‍🚀 1.1.0
  • utf8_general_ci vs. utf8_unicode_ci

    Blogs20122012-09-01


    utf8_general_ci vs. utf8_unicode_ci

    While setup MySQl DB, what’s the difference between utf8_general_ci vs. utf8_unicode_ci?
    I found a good article for the explain: http://forums.mysql.com/read.php?103,187048,188748#msg-188748:

    utf8_general_ci is a very simple collation. What it does - it just
    - removes all accents
    - then converts to upper case
    and uses the code of this sort of “base letter” result letter to compare.
    For example, these Latin letters: ÀÁÅåāă (and all other Latin letters “a” with any accents and in any cases) are all compared as equal to “A”.

    utf8_unicode_ci uses the default Unicode collation element table (DUCET).

    The main differences are:
    1. utf8_unicode_ci supports so called expansions and ligatures, for example:
    German letter ß (U+00DF LETTER SHARP S) is sorted near “ss”
    Letter Œ (U+0152 LATIN CAPITAL LIGATURE OE) is sorted near “OE”.
    utf8_general_ci does not support expansions/ligatures, it sorts all these letters as single characters, and sometimes in a wrong order.

    2. utf8_unicode_ci is *generally* more accurate for all scripts. For example, on Cyrillic block:
    utf8_unicode_ci is fine for all these languages:
    Russian, Bulgarian, Belarusian, Macedonian, Serbian, and Ukrainian.
    While utf8_general_ci is fine only for Russian and Bulgarian subset of Cyrillic. Extra letters used in Belarusian, Macedonian, Serbian, and Ukrainian

    The disadvantage of utf8_unicode_ci is that it is a little bit slower than utf8_general_ci.

    So when you need better sorting order - use utf8_unicode_ci, and when you utterly interested in performance - use utf8_general_ci.