Hello, At Tue, 20 Dec 2016 16:41:51 -0800, James Zhou <james@xxxxxxxxxx> wrote in <CAGuREpPHJmoHe_5+P25UCosRvqQpbhPF_0LGFbJ+xYgUKndydg@xxxxxxxxxxxxxx> > Unicode has evolved from version 1.0 with 7,161 characters released in 1991 > to version 9.0 with 128,172 characters released in June 2016. My questions > are > - which version of Unicode is supported by PostgreSQL 9.6.1? > - what does "supported" exactly mean? simply store it? comparison? sorting? > substring? etc. ... > /* characters from BMP, 0000 - FFFF */ > insert into unicode(id, string) values(1, U&'\0041'); -- 'A' ... > insert into unicode(id, string) values(5, U&'\6211\4EEC'); -- a string of two Chinese characters These shouldn't be a problem. > /* Below are unicode characters with code points beyond FFFF, aka planes 1 - F */ > insert into unicode(id, string) values(100, U&'\1F478'); -- a mojo character, https://unicodelookup.com/#0x1f478/1 https://www.postgresql.org/docs/9.6/static/sql-syntax-lexical.html > Unicode characters can be specified in escaped form by writing a > backslash followed by the four-digit hexadecimal code point > number or alternatively a backslash followed by a plus sign > followed by a six-digit hexadecimal code point number. So this is parsed as U+1f47 + '8' as you seen. This should be as the following. '+' is needed just after the backslash. insert into unicode(id, string) values(100, U&'\+01F478'); The six-digit form accepts up to U+10FFFF so the whole space in Unicode is usable. > Observations > > - BMP characters (id <= 10) > - they are stored and fetched correctly. > - their lengths in char are correct, although some of them take 3 > bytes (id = 4, 6, 7) > - *But their sorting order seems to be undefined. Can anyone comment > the sorting rules?* > - Non-BMP characters (id >= 100) > - they take 2 - 4 bytes. > - Their lengths in character are not correct > - they are not retrieved correctly, judged by the their fetched ascii > value (column 5 in the table above) > - substring is not correct > > Specifically, the lack of support for emojo characters 0x1F478, 0x1F479 is > causing a problem in my application. '+' would resolve the problem. > My conclusion: > - PostgreSQL 9.6.1 only supports a subset of unicode characters in BMP. Is > there any documents defining which subset is fully supported? A PostgreSQL database with encoding=UTF8 just accepts the whole range of Unicode, regardless that a character is defined for the code or not. > Are any configuration I can change so that more unicode characters are > supported? For the discussion on sorting, categorize is described in Tom's mail. -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general