Search Postgresql Archives

Re: The dangers of streaming across versions of glibc: A cautionary tale

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Aug 6, 2014 at 5:11 PM, Bruce Momjian <bruce@xxxxxxxxxx> wrote:
> No surprise;  I have been expecting to hear about such breakage, and am
> surprised we hear about it so rarely.  We really have no way of testing
> for breakage either.  :-(

I guess that Trip Advisor were using some particular collation that
had a chance of changing. Sorting rules for English text (so, say,
en_US.UTF-8) are highly unlikely to change. That might be much less
true for other locales.

Unicode Technical Standard #10 states:

"""
Collation order is not fixed.

Over time, collation order will vary: there may be fixes needed as
more information becomes available about languages; there may be new
government or industry standards for the language that require
changes; and finally, new characters added to the Unicode Standard
will interleave with the previously-defined ones. This means that
collations must be carefully versioned.
"""

So, the reality is that we only have ourselves to blame.  :-(

LC_IDENTIFICATION serves this purpose on glibc. Here is what en_US
looks like on my machine:

"""
escape_char /
comment_char %
% Locale for English locale in the USA
% Contributed by Ulrich Drepper <drepper@xxxxxxxxxx>, 2000

LC_IDENTIFICATION
title      "English locale for the USA"
source     "Free Software Foundation, Inc."
address    "59 Temple Place - Suite 330, Boston, MA 02111-1307, USA"
contact    ""
email      "bug-glibc-locales@xxxxxxx"
tel        ""
fax        ""
language   "English"
territory  "USA"
revision   "1.0"
date       "2000-06-24"
%
category  "en_US:2000";LC_IDENTIFICATION
category  "en_US:2000";LC_CTYPE
category  "en_US:2000";LC_COLLATE
category  "en_US:2000";LC_TIME
category  "en_US:2000";LC_NUMERIC
category  "en_US:2000";LC_MONETARY
category  "en_US:2000";LC_MESSAGES
category  "en_US:2000";LC_PAPER
category  "en_US:2000";LC_NAME
category  "en_US:2000";LC_ADDRESS
category  "en_US:2000";LC_TELEPHONE
*** SNIP ***
"""

This is a GNU extension [1]. If the OS adds a new version of a
collation, that probably accidentally works a lot of the time, because
the collation rule added or removed was fairly esoteric anyway, such
is the nature of these things. If it was something that came up a lot,
it would surely have been settled by standardization years ago.

If OS vendors are not going to give us a standard API for versioning,
we're hosed. I thought about suggesting that we hash a strxfrm() blob
for about 2 minutes, before realizing that that's a stupid idea. Glibc
would be a good start.

[1] https://www.gnu.org/software/autoconf/manual/autoconf-2.63/html_node/Special-Shell-Variables.html
-- 
Regards,
Peter Geoghegan


-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general




[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]
  Powered by Linux