Search Postgresql Archives

Re: Upgrading locale issues

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 05/03/2019 05:35 PM, Daniel Verite wrote:
	rihad wrote:

Thanks, I'm a bit confused here. AFAIK indexes are used for at least two
things: for speed and for skipping the ORDER BY step (since btree
indexes are already sorted). Will such an "upgrade-immune" C.UTF-8 index
still work correctly for table lookups?
If the lookup is based on a equality test or a pattern match with LIKE
or a regexp, it makes no difference.  But the locale makes a
difference with inequality tests, such as < > or BETWEEN.

Around version 9.1 and in the pre-ICU days, Robert Haas wrote
this post that explained it pretty well, I think:
http://rhaas.blogspot.com/2012/03/perils-of-collation-aware-comparisons.html

Quote:

   If you happen to need the particular sorting behavior that
   collation-aware sorting and comparisons provide, then you may find
   this price worth paying, but I suspect there are a lot of people out
   there who are paying it more or less accidentally and don't really
   care very much about the underlying sorting behavior.  If, for
   example, all of your queries are based on equality, and you don't
   use greater-than or less-than tests, then it doesn't matter what
   collation is in use.	You might as well use "C" instead of whatever
   your local default may be, because it's faster.

For non-English text, I would recommend C.UTF-8 over "C" because of
its better handling of Unicode characters. For instance:

=# select upper('été' collate "C"), upper('été' collate "C.UTF-8");
  upper | upper
-------+-------
  éTé   | ÉTÉ

The "price" of linguistic comparisons that Robert mentioned was about
performance, but the troubles we have with the lack of long-term
immutability of these locales are worth being added to that.

And can the existing en_US.UTF-8 fields' definition be altered in
place, without a dump+restore?
Changing the COLLATE clause of a column with ALTER TABLE does
implicitly rebuild an index on this column if there is one,
A dump+restore is not needed, nor an explicit REINDEX.

The dump+restore is needed in another scenario, where you would
decide to change the LC_COLLATE and LC_CTYPE of the database,
instead of doing it only for some columns.
This scenario makes perfect sense if the locale of the database
has been set implicitly and it uses linguistic sorts without
really needing them ("accidentally" as said in the post).


en_US.UTF-8 is the default encoding+locale+collation, it
isn't set explicitly for any of our string columns. I assume there's
some "catch-all" ordering taking place even for the C locale, so there
won't be any bizarre things like b coming before a, or generally for any
language, the second letter of its alphabet coming before the first?
'b' < 'a' is never true for any locale AFAIK, it is 'B' < 'a' that is
true for some locales such as C or C.UTF-8.


Best regards,


Thanks a lot for sharing your insights!






[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]

  Powered by Linux