Locale collation issues

Jim Gates <James.Gates@xxxxxxx> · Fri, 26 Oct 2007 17:25:14 -0400

I'm investigating a problem reported by a customer where it seems the 
PostgreSQL server collation sequence for certain locales don't match the 
operating system's collation sequence (for the same locale set during 
initdb).

I can reproduce this on both 8.1.9 & 8.2.5. I'm using Solaris Nevada, 
but I also see the same issue on Solaris 10.

1) First I have a text file containing some LATIN1 (ISO8859-1) 
characters in random order (followed by their decimal byte code):

$ cat barf
Ö       214
A       65
a       97
Å       197
ä       228
Ä       196

2) Perform ascending byte order & dictionary sorts (on 1st field only) 
with the "C" locale (Note that the results are the same when using a 
POSIX locale):

$ locale
LANG=C
LC_CTYPE="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_COLLATE="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_ALL=

$ sort +1 -0 barf
A       65
a       97
Ä       196
Å       197
Ö       214
ä       228

$ sort +0 -1 -d barf
Ä       196
Å       197
Ö       214
ä       228
A       65
a       97

3) Perform ascending byte order & dictionary sorts (on 1st field only) 
with the customer's locale (Note that the results are the same when 
using any ISO8859 locale):

$ locale
LANG=sv_SE.ISO8859-1
LC_CTYPE="sv_SE.ISO8859-1"
LC_NUMERIC="sv_SE.ISO8859-1"
LC_TIME="sv_SE.ISO8859-1"
LC_COLLATE="sv_SE.ISO8859-1"
LC_MONETARY="sv_SE.ISO8859-1"
LC_MESSAGES="sv_SE.ISO8859-1"
LC_ALL=

$ sort +1 -0 barf
A       65
a       97
Ä       196
Å       197
Ö       214
ä       228

$ sort +0 -1 -d barf
a       97
A       65
Å       197
ä       228
Ä       196
Ö       214

(Note that in the CLDR definition for 8859-1 locales, lower case letters 
appear before the equivalent upper case letter in a dictionary sort)

4) Now create a db with the appropriate locale:

$ locale
LANG=sv_SE.ISO8859-1
LC_CTYPE="sv_SE.ISO8859-1"
LC_NUMERIC="sv_SE.ISO8859-1"
LC_TIME="sv_SE.ISO8859-1"
LC_COLLATE="sv_SE.ISO8859-1"
LC_MONETARY="sv_SE.ISO8859-1"
LC_MESSAGES="sv_SE.ISO8859-1"
LC_ALL=

$ initdb
The files belonging to this database system will be owned by user 
"postgres".
This user must also own the server process.

The database cluster will be initialized with locale sv_SE.ISO8859-1.
The default database encoding has accordingly been set to LATIN1.

fixing permissions on existing directory 
/var/tmp/postgres/8.2.5/sv_SE.ISO8859-1 ... ok
creating subdirectories ... ok
selecting default max_connections ... 100
selecting default shared_buffers/max_fsm_pages ... 32MB/204800
creating configuration files ... ok
creating template1 database in 
/var/tmp/postgres/8.2.5/sv_SE.ISO8859-1/base/1 ... ok
initializing pg_authid ... ok
initializing dependencies ... ok
creating system views ... ok
loading system objects' descriptions ... ok
creating conversions ... ok
setting privileges on built-in objects ... ok
creating information schema ... ok
vacuuming database template1 ... ok
copying template1 to template0 ... ok
copying template1 to postgres ... ok
.... etc ....

5) Check the correct collation sequence & char set in psql:

postgres=# \l
        List of databases
   Name    |  Owner   | Encoding
-----------+----------+----------
 postgres  | postgres | LATIN1
 template0 | postgres | LATIN1
 template1 | postgres | LATIN1
(3 rows)

postgres=# show lc_collate;
   lc_collate
-----------------
 sv_SE.ISO8859-1
(1 row)

6) And try the same sort on a table containing the same characters:

postgres=# select barf,ascii(barf) from jim order by barf asc;
 barf | ascii
------+-------
 Å    |   197
 ä    |   228
 Ä    |   196
 Ö    |   214
 a    |    97
 A    |    65
(6 rows)

Notice that the results don't match the operating system's (either byte 
code or dictionary) sort order for the same locale, or even the C or 
POSIX locales.

What is most unusual, is that 'Ö' appears before 'a' or 'A', which you 
wouldn't expect for a dictionary style sort, but this clearly isn't a 
byte code order either.

In fact, I can't tell where this order is derived from. I thought that 
calling initdb with a particular locale meant that the database used all 
aspects of the operating systems locale, including collation sequence. 
This is implied in the docs at:

http://www.postgresql.org/docs/8.2/interactive/sql-select.html#SQL-ORDERBY

> Character-string data is sorted according to the locale-specific
> collation order that was established when the database cluster was
> initialized.

I'm probably missing (or misunderstanding) something, since I don't 
fully understand PostgreSQL's localization. Can someone please explain?

Many thanks.

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
      subscribe-nomail command to majordomo@xxxxxxxxxxxxxx so that your
      message can get through to the mailing list cleanly