Re: How well does PostgreSQL 9.6.1 support unicode?

Kyotaro HORIGUCHI <horiguchi.kyotaro@xxxxxxxxxxxxx> · Wed, 21 Dec 2016 16:56:37 +0900 (Tokyo Standard Time)

Hello,

At Tue, 20 Dec 2016 16:41:51 -0800, James Zhou <james@xxxxxxxxxx> wrote in <CAGuREpPHJmoHe_5+P25UCosRvqQpbhPF_0LGFbJ+xYgUKndydg@xxxxxxxxxxxxxx>
> Unicode has evolved from version 1.0 with 7,161 characters released in 1991
> to version 9.0 with 128,172 characters released in June 2016. My questions
> are
> - which version of Unicode is supported by PostgreSQL 9.6.1?
> - what does "supported" exactly mean? simply store it? comparison? sorting?
> substring? etc.
...
> /* characters from BMP, 0000 - FFFF */
> insert into unicode(id, string) values(1, U&'\0041');  -- 'A'
...
> insert into unicode(id, string) values(5, U&'\6211\4EEC'); -- a string of two Chinese characters

These shouldn't be a problem.

> /* Below are unicode characters with code points beyond FFFF, aka planes 1 - F */
> insert into unicode(id, string) values(100, U&'\1F478'); -- a mojo character, https://unicodelookup.com/#0x1f478/1

https://www.postgresql.org/docs/9.6/static/sql-syntax-lexical.html

> Unicode characters can be specified in escaped form by writing a
> backslash followed by the four-digit hexadecimal code point
> number or alternatively a backslash followed by a plus sign
> followed by a six-digit hexadecimal code point number.

So this is parsed as U+1f47 + '8' as you seen. This should be as
the following. '+' is needed just after the backslash.

insert into unicode(id, string) values(100, U&'\+01F478');

The six-digit form accepts up to U+10FFFF so the whole space in
Unicode is usable.

> Observations
> 
>    - BMP characters (id <= 10)
>       -  they are stored and fetched correctly.
>       - their lengths in char are correct, although some of them take 3
>       bytes (id = 4, 6, 7)
>       - *But their sorting order seems to be undefined. Can anyone comment
>       the sorting rules?*
>    - Non-BMP characters (id >= 100)
>       - they take 2 - 4 bytes.
>       - Their lengths in character are not correct
>       - they are not retrieved correctly, judged by the their fetched ascii
>       value (column 5 in the table above)
>       - substring is not correct

> 
> Specifically, the lack of support for emojo characters 0x1F478, 0x1F479 is
> causing a problem in my application.

'+' would resolve the problem.

> My conclusion:
> - PostgreSQL 9.6.1 only supports a subset of unicode characters in BMP.  Is
> there any documents defining which subset is fully supported?

A PostgreSQL database with encoding=UTF8 just accepts the whole
range of Unicode, regardless that a character is defined for the
code or not.

> Are any configuration I can change so that more unicode characters are
> supported?

For the discussion on sorting, categorize is described in Tom's
mail.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general