rihad wrote: > Thanks, I'm a bit confused here. AFAIK indexes are used for at least two > things: for speed and for skipping the ORDER BY step (since btree > indexes are already sorted). Will such an "upgrade-immune" C.UTF-8 index > still work correctly for table lookups? If the lookup is based on a equality test or a pattern match with LIKE or a regexp, it makes no difference. But the locale makes a difference with inequality tests, such as < > or BETWEEN. Around version 9.1 and in the pre-ICU days, Robert Haas wrote this post that explained it pretty well, I think: http://rhaas.blogspot.com/2012/03/perils-of-collation-aware-comparisons.html Quote: If you happen to need the particular sorting behavior that collation-aware sorting and comparisons provide, then you may find this price worth paying, but I suspect there are a lot of people out there who are paying it more or less accidentally and don't really care very much about the underlying sorting behavior. If, for example, all of your queries are based on equality, and you don't use greater-than or less-than tests, then it doesn't matter what collation is in use. You might as well use "C" instead of whatever your local default may be, because it's faster. For non-English text, I would recommend C.UTF-8 over "C" because of its better handling of Unicode characters. For instance: =# select upper('été' collate "C"), upper('été' collate "C.UTF-8"); upper | upper -------+------- éTé | ÉTÉ The "price" of linguistic comparisons that Robert mentioned was about performance, but the troubles we have with the lack of long-term immutability of these locales are worth being added to that. > And can the existing en_US.UTF-8 fields' definition be altered in > place, without a dump+restore? Changing the COLLATE clause of a column with ALTER TABLE does implicitly rebuild an index on this column if there is one, A dump+restore is not needed, nor an explicit REINDEX. The dump+restore is needed in another scenario, where you would decide to change the LC_COLLATE and LC_CTYPE of the database, instead of doing it only for some columns. This scenario makes perfect sense if the locale of the database has been set implicitly and it uses linguistic sorts without really needing them ("accidentally" as said in the post). > en_US.UTF-8 is the default encoding+locale+collation, it > isn't set explicitly for any of our string columns. I assume there's > some "catch-all" ordering taking place even for the C locale, so there > won't be any bizarre things like b coming before a, or generally for any > language, the second letter of its alphabet coming before the first? 'b' < 'a' is never true for any locale AFAIK, it is 'B' < 'a' that is true for some locales such as C or C.UTF-8. Best regards, -- Daniel Vérité PostgreSQL-powered mailer: http://www.manitou-mail.org Twitter: @DanielVerite