Mixing different LC_COLLATE and database encodings

Bill Moseley <moseley@xxxxxxxx> · Sat, 18 Feb 2006 08:10:19 -0800

I've been going through the docs and list archives trying to get
clear on encoding issues, but still have a few question.

Do I have these statements correct?

- LC_COLLATE is set on the cluster at initdb time.  From that point
on all database text is sorted based on that *regardless* of the
encoding set on an individual database.

- So for lc_collate="C" sorting is just based on the byte values, and
if lc_collate="en_US" then sorting is based on the us_US order, and
the bytes are assumed to be 8859-1 (if that matters).

- To clarify the first point, if the database is encoded utf-8 and
lc_collate is en_US then Postgresql does NOT try to convert utf-8 to
8859-1 before sorting.

- If the "client encoding" and the database encoding differ then
Postgresql will convert between the two encodings during I/O.

- The exception is if *either* the client or the server's encoding is
"SQL_ASCII" then no client<=>server conversion is done.

Sound about right?

1) What else is the database's encoding used for besides to determine
how to convert text in input and output based on the client encoding?

2) What client encoding is used if the client does not specify one?
For example, I'm looking through Perl's DBD::Pg and I don't see any
place where it calls PQsetClientEncoding().

  http://search.cpan.org/src/DBDPG/DBD-Pg-1.41/
  http://www.postgresql.org/docs/7.4/interactive/multibyte.html#MULTIBYTE-TRANSLATION-TABLE

3) The vast majority of my utf-8 encoded text that I need to display
sorted probably maps to 8859-1 characters.

I think I already answered this above, but:

Am I correct that Postgresql is *not* converting text from the
database encoding to the cluster encoding before sorting?  That is
with "C" it's just sorting in byte order, and with en_US it's just
assuming that the bytes are 8859-1 and ignoring that it's really
utf-8?

That is, if I have text that's in utf-8 but includes characters that
would map to 8859-1 (say accented chars), that sorting will not be
correct because it's not converted to 8859-1 when sorting?

4) If the above is true, then if I wanted my utf-8 encoded text to be
sorted correctly then I'd need to re-initdb using --encoding=en_US.UTF-8,
correct?

5) I suppose there's not way to answer this, short of running
benchmarks, but any ideas what using a lc_collate with utf-8 would do
to performance?  Is it a big hit?

Not related to Postgresql, but testing some of this is confusing
due to my environment.  How do I get my xterm to work with utf8?
Does ssh do something with encoding?

If I have a utf8 xterm window open on my machine, then ssh to
the server running postgresql where the default locale is "POSIX"
Then running:

    LANG=en_US.utf8 psql utf8test
    utf8test=> \encoding
    UNICODE

    utf8test=> select first_name from person where last_name = 'Anderson';

Then I see:

    Zo<C3><AB>

But, if on that same remote machine I run a unicode xterm (uxterm in
Debian) then in that xterm window I do:

    utf8test=> \encoding
    UNICODE
    utf8test=> select first_name from person where last_name = 'Anderson';
    Zoë  (correct)

It's must slower running xterm remotely than using my local xterm and
ssh, so it would be nice to be able to display the utf8.

-- 
Bill Moseley
moseley@xxxxxxxx