Re: Mixing different LC_COLLATE and database encodings

Greg Stark <gsstark@xxxxxxx> · 18 Feb 2006 21:31:27 -0500

Bill Moseley <moseley@xxxxxxxx> writes:

>     $ LC_ALL=en_US.UTF-8 locale charmap
>     UTF-8
> 
>     $ LC_ALL=en_US locale charmap
>     ISO-8859-1
> 
>     $ LC_ALL=C locale charmap
>     ANSI_X3.4-1968

Unfortunately Postgres only supports a single collation cluster-wide. So
depending on which collation you use of the ones above you would really have
to select either UTF-8 ISO-8859-1 or SQL_ASCII (ie ANSI_X3.4-1968). Anything
else and the collation just won't work properly. It will be expecting UTF-8
and be fed ISO-8859-1 strings, resulting in weird and sometimes inconsistent
sort orders.

There's a certain amount of feeling that using any locale other than C is
probably not ever the right thing given the current functionality. Just about
any database has some strings in it that are really just ascii strings like
char(1) primary keys and other internal database strings. You may not want
them being subject to the locale's collation for comparison purposes and you
may not want the overhead of variable width character encodings.

Those of us in this camp are defining all our databases using C locale and
then using the pg_strxfrm() function that's been floating around the list for
a while to handle sorting strings that need to be sorted in various locales.
This has performs acceptably (but not spectacularly) under glibc but it's not
clear which other libc implementations it works well under.

It also doesn't solve the whole problem since functions like substr() or LIKE
are locale sensitive too. If you need an encoding like UTF-8 and you're stuck
either pushing all your string manipulations into the client or going ahead
with a non-C locale and UTF-8 even for the strings that are really just ascii
strings.

-- 
greg