Re: Encoding, Unicode, locales, etc.

Carlos Moreno <moreno_pg@xxxxxxxxxxx> · Wed, 01 Nov 2006 09:50:01 -0500

Thanks Tom, for your reply.

Tom Lane wrote:

Carlos Moreno <moreno_pg@xxxxxxxxxxx> writes:

Why is it that the database
cluster is resrticted to a single locale (or single set of locales) instead
of being configurable on a per-database basis?

Because we depend on libc's locale support, which (on many platforms)
isn't designed to switch between locales cheaply  [...]

This stuff is certainly far from ideal, but the amount of work involved
to fix it is daunting; see many past pg-hackers discussions.

Fair enough --- and good to know.

2)  On the same token (more or less), I have a test database, for which
I ran initdb without specifying encoding or locale;  then, I create a
database with UTF8 encoding.

There's no such thing as "you didn't specify a locale".  If you didn't
specify one on the initdb command line, then it was taken from the
environment.  Try "show lc_collate" and "show lc_ctype" to see what
got used.

Yes, that's what I meant --- I meant that I did not use the --locale or 
-E command-
line switches for the initdb command.  Both lc_ctype and lc_collate show
en_US.UTF-8

I try lower of a string that
contains characters with accents  (e.g., Spanish or French characters),
and it works as it should according to Spanish or French rules --- it
returns a string with the same characters in lowecase, with the same
accent.  Why did that work?  My Linux machine has all en_US.UTF-8
locales, and en_US is not even aware of characters with accents,

You sure?  I'd sort of expect a UTF8 locale to know this stuff anyway.
In any case, Postgres doesn't know anything about case conversion
beyond what toupper/tolower tell it, so your experimental result is
sufficient proof that that locale includes these conversions.

Are you sure there's nothing about the way PostgreSQL interacts with C
conversion functions?   I ask because, as part of a "sanity check", I 
repeated
the tests --- now with two machines;  one that has PG 8.1.4, and the 
other one
has 7.4.14, and they behave differently.

The one that does the case conversion "correctly" (read:  as I expect it 
as per
Spanish or French rules) is 8.1.4 with en_US locale (LC_CTYPE and
LC_COLLATE both showing en_US.UTF-8).  PG 7.4.14, *even with
locale es_ES*, does not do the case conversion  (characters with accent
or tilde are left untouched).

I wonder if someone could shed some light on this little mystery....???
Perhaps to add more confusion to my experimental/informal tests, PG 8.1.4
is running on a FC4 AMD64 X2 box  (the command "locale" at the shell
prompt shows all en_US.utf8), and PG 7.4.14 is running on a laptop with
FC5 on an Intel Celeron M  (the command locale shows exactly the same
in that case).   Does this perhaps account for the difference?

Thanks,

Carlos
--