On Wed, Aug 6, 2014 at 5:11 PM, Bruce Momjian <bruce@xxxxxxxxxx> wrote: > No surprise; I have been expecting to hear about such breakage, and am > surprised we hear about it so rarely. We really have no way of testing > for breakage either. :-( I guess that Trip Advisor were using some particular collation that had a chance of changing. Sorting rules for English text (so, say, en_US.UTF-8) are highly unlikely to change. That might be much less true for other locales. Unicode Technical Standard #10 states: """ Collation order is not fixed. Over time, collation order will vary: there may be fixes needed as more information becomes available about languages; there may be new government or industry standards for the language that require changes; and finally, new characters added to the Unicode Standard will interleave with the previously-defined ones. This means that collations must be carefully versioned. """ So, the reality is that we only have ourselves to blame. :-( LC_IDENTIFICATION serves this purpose on glibc. Here is what en_US looks like on my machine: """ escape_char / comment_char % % Locale for English locale in the USA % Contributed by Ulrich Drepper <drepper@xxxxxxxxxx>, 2000 LC_IDENTIFICATION title "English locale for the USA" source "Free Software Foundation, Inc." address "59 Temple Place - Suite 330, Boston, MA 02111-1307, USA" contact "" email "bug-glibc-locales@xxxxxxx" tel "" fax "" language "English" territory "USA" revision "1.0" date "2000-06-24" % category "en_US:2000";LC_IDENTIFICATION category "en_US:2000";LC_CTYPE category "en_US:2000";LC_COLLATE category "en_US:2000";LC_TIME category "en_US:2000";LC_NUMERIC category "en_US:2000";LC_MONETARY category "en_US:2000";LC_MESSAGES category "en_US:2000";LC_PAPER category "en_US:2000";LC_NAME category "en_US:2000";LC_ADDRESS category "en_US:2000";LC_TELEPHONE *** SNIP *** """ This is a GNU extension [1]. If the OS adds a new version of a collation, that probably accidentally works a lot of the time, because the collation rule added or removed was fairly esoteric anyway, such is the nature of these things. If it was something that came up a lot, it would surely have been settled by standardization years ago. If OS vendors are not going to give us a standard API for versioning, we're hosed. I thought about suggesting that we hash a strxfrm() blob for about 2 minutes, before realizing that that's a stupid idea. Glibc would be a good start. [1] https://www.gnu.org/software/autoconf/manual/autoconf-2.63/html_node/Special-Shell-Variables.html -- Regards, Peter Geoghegan -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general