Re: Upgrading locale issues

"Daniel Verite" <daniel@xxxxxxxxxxxxxxxx> · Fri, 03 May 2019 15:35:47 +0200

	rihad wrote:

> Thanks, I'm a bit confused here. AFAIK indexes are used for at least two 
> things: for speed and for skipping the ORDER BY step (since btree 
> indexes are already sorted). Will such an "upgrade-immune" C.UTF-8 index 
> still work correctly for table lookups?

If the lookup is based on a equality test or a pattern match with LIKE
or a regexp, it makes no difference.  But the locale makes a
difference with inequality tests, such as < > or BETWEEN.

Around version 9.1 and in the pre-ICU days, Robert Haas wrote
this post that explained it pretty well, I think:
http://rhaas.blogspot.com/2012/03/perils-of-collation-aware-comparisons.html

Quote:

  If you happen to need the particular sorting behavior that
  collation-aware sorting and comparisons provide, then you may find
  this price worth paying, but I suspect there are a lot of people out
  there who are paying it more or less accidentally and don't really
  care very much about the underlying sorting behavior.  If, for
  example, all of your queries are based on equality, and you don't
  use greater-than or less-than tests, then it doesn't matter what
  collation is in use.	You might as well use "C" instead of whatever
  your local default may be, because it's faster.

For non-English text, I would recommend C.UTF-8 over "C" because of
its better handling of Unicode characters. For instance:

=# select upper('été' collate "C"), upper('été' collate "C.UTF-8");
 upper | upper 
-------+-------
 éTé   | ÉTÉ

The "price" of linguistic comparisons that Robert mentioned was about
performance, but the troubles we have with the lack of long-term
immutability of these locales are worth being added to that.

> And can the existing en_US.UTF-8 fields' definition be altered in
> place, without a dump+restore?

Changing the COLLATE clause of a column with ALTER TABLE does
implicitly rebuild an index on this column if there is one,
A dump+restore is not needed, nor an explicit REINDEX.

The dump+restore is needed in another scenario, where you would
decide to change the LC_COLLATE and LC_CTYPE of the database,
instead of doing it only for some columns.
This scenario makes perfect sense if the locale of the database
has been set implicitly and it uses linguistic sorts without
really needing them ("accidentally" as said in the post).

> en_US.UTF-8 is the default encoding+locale+collation, it 
> isn't set explicitly for any of our string columns. I assume there's 
> some "catch-all" ordering taking place even for the C locale, so there 
> won't be any bizarre things like b coming before a, or generally for any 
> language, the second letter of its alphabet coming before the first?

'b' < 'a' is never true for any locale AFAIK, it is 'B' < 'a' that is
true for some locales such as C or C.UTF-8.

Best regards,
-- 
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite