Vick Khera <vivek@xxxxxxxxx> writes: > On Wed, Dec 21, 2016 at 2:56 AM, Kyotaro HORIGUCHI < > horiguchi.kyotaro@xxxxxxxxxxxxx> wrote: >> A PostgreSQL database with encoding=UTF8 just accepts the whole >> range of Unicode, regardless that a character is defined for the >> code or not. > Interesting... when I converted my application and database to utf8 > encoding, I discovered that Postgres is picky about UTF-8. Specifically the > UTF-8 code point 0xed 0xa0 0x8d which maps to UNICODE code point 0xd80d. > This looks like a proper character but in fact is not a defined character > code point. Well, we're picky to the extent that RFC 3629 tells us to be picky: http://www.faqs.org/rfcs/rfc3629.html The case you mention is rejected because it would be half of a UTF16 "surrogate pair", which should not be used in any Unicode representation other than UTF16; if we allowed it then there would be more than one way to represent the same Unicode code point, which is undesirable for a lot of reasons. > So I think when you present an actual string of UTF8 encoded characters, > Postgres does refuse characters unknown. However, as you observe, inserting > the unicode code point directly does not produce an error: > insert into unicode(id, string) values(1, U&'\d80d'); > INSERT 0 1 Hm. I think that's a bug. The lexer does know that \d80d is half of a surrogate pair, and it expects the second half to come next. If you supply something that isn't the second half of a surrogate pair, you get an error as expected: u8=# insert into unicode(id, string) values(1, U&'\d80dfoo'); ERROR: invalid Unicode surrogate pair at or near "foo'" LINE 1: insert into unicode(id, string) values(1, U&'\d80dfoo'); ^ But it looks like if you just end the string after the first half of a surrogate, it just drops the character without complaint. Notice that what got inserted was a zero-length string, not U+D08D: u8=# select *, length(string) from unicode; id | string | length ----+--------+-------- 1 | | 0 (1 row) I'd have expected a syntax error along the line of "incomplete Unicode surrogate pair". Peter, I think this was your code to begin with --- was it intentional to not raise error here, or is that an oversight? regards, tom lane -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general