Search Postgresql Archives

Re: Collate order on Mac OS X, text with diacritics in UTF-8

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 13/01/2010 11:15 PM, Martin Flahault wrote:

It seems there is a problem with the collating order on BSD systems with
diacritics using UTF8.
If you put this text :
a
A
à
é
e
E

in a UTF8 text file and use the "sort" command on it, you will have the
same wrong output as with PostgreSQL :
A
E
a
e
à
é

First: PostgreSQL expects the OS to behave correctly and sort according to the locale. It relies on the C library for this. If the C library doesn't do it right, PostgreSQL won't do it right either. So you need to get Mac OS X to do the right thing.

Your results match what I get on a Linux system without a properly generated fr_FR.UTF-8 locale. Libc falls back on the "C" locale, which sorts that way.

If I generate the fr_FR.UTF-8 locale and run the sort (on the file "x"), I get the desired result:

LANG=fr_FR.UTF-8 LC_ALL=fr_FR.UTF-8 sort x
a
A
à
e
E
é

I don't know Mac OS X well, but this is making me wonder if maybe you're just missing the required information for the locale, so libc is falling back on the "C" locale.

(Of course, being Mac OS X there are probably at least three out of date or simply false "man" pages describing the behaviour, none of which reflect the reality of a magic config key buried somewhere in NetInfo, for which the documentation is also completely out of date. Bitter? Me? Yeah, I admin a bunch of OS X machines on a business network.)

Hmm... a quick test suggests that Mac OS X (testing on 10.4) at least *thinks* it supports the fr_FR.UTF-8 locale:

osx104$ LANG=xxx LC_ALL=xxx locale
LANG="xxx"
LC_COLLATE="C"
LC_CTYPE="C"
LC_MESSAGES="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_ALL="C"

osx104$ LANG=fr_FR.UTF-8 LC_ALL=fr_FR.UTF-8 locale
LANG="fr_FR.UTF-8"
LC_COLLATE="fr_FR.UTF-8"
LC_CTYPE="fr_FR.UTF-8"
LC_MESSAGES="fr_FR.UTF-8"
LC_MONETARY="fr_FR.UTF-8"
LC_NUMERIC="fr_FR.UTF-8"
LC_TIME="fr_FR.UTF-8"
LC_ALL="fr_FR.UTF-8"

osx104$ locale -a  | grep fr_FR
fr_FR
fr_FR.ISO8859-1
fr_FR.ISO8859-15
fr_FR.UTF-8

... yet it clearly doesn't:

osx104$ LANG=C LC_ALL=C sort x
A
E
a
e
à
é
osx104$ LANG=fr_FR.UTF-8 LC_ALL=fr_FR.UTF-8 sort x
A
E
a
e
à
é
osx104$ LANG=fr_FR.ISO8859-1 LC_ALL=fr_FR.ISO8859-1 sort x
A
E
a
e
à
é

Mac OS X seems to keep its locale config in /usr/share/locale . Looking there, there are clearly LC_COLLATE files for fr_FR.UTF-8 . However, they're identical to those for en_US.UTF-8:

osx104$ cd /usr/share/locale
osx104$ diff fr_FR.UTF-8/LC_COLLATE en_US.UTF-8/LC_COLLATE

... so your OS's localized collation support is broken/missing, at least if the same is true for more modern versions of OS X.

--
Craig Ringer

--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]
  Powered by Linux