I've been going through the docs and list archives trying to get clear on encoding issues, but still have a few question. Do I have these statements correct? - LC_COLLATE is set on the cluster at initdb time. From that point on all database text is sorted based on that *regardless* of the encoding set on an individual database. - So for lc_collate="C" sorting is just based on the byte values, and if lc_collate="en_US" then sorting is based on the us_US order, and the bytes are assumed to be 8859-1 (if that matters). - To clarify the first point, if the database is encoded utf-8 and lc_collate is en_US then Postgresql does NOT try to convert utf-8 to 8859-1 before sorting. - If the "client encoding" and the database encoding differ then Postgresql will convert between the two encodings during I/O. - The exception is if *either* the client or the server's encoding is "SQL_ASCII" then no client<=>server conversion is done. Sound about right? 1) What else is the database's encoding used for besides to determine how to convert text in input and output based on the client encoding? 2) What client encoding is used if the client does not specify one? For example, I'm looking through Perl's DBD::Pg and I don't see any place where it calls PQsetClientEncoding(). http://search.cpan.org/src/DBDPG/DBD-Pg-1.41/ http://www.postgresql.org/docs/7.4/interactive/multibyte.html#MULTIBYTE-TRANSLATION-TABLE 3) The vast majority of my utf-8 encoded text that I need to display sorted probably maps to 8859-1 characters. I think I already answered this above, but: Am I correct that Postgresql is *not* converting text from the database encoding to the cluster encoding before sorting? That is with "C" it's just sorting in byte order, and with en_US it's just assuming that the bytes are 8859-1 and ignoring that it's really utf-8? That is, if I have text that's in utf-8 but includes characters that would map to 8859-1 (say accented chars), that sorting will not be correct because it's not converted to 8859-1 when sorting? 4) If the above is true, then if I wanted my utf-8 encoded text to be sorted correctly then I'd need to re-initdb using --encoding=en_US.UTF-8, correct? 5) I suppose there's not way to answer this, short of running benchmarks, but any ideas what using a lc_collate with utf-8 would do to performance? Is it a big hit? Not related to Postgresql, but testing some of this is confusing due to my environment. How do I get my xterm to work with utf8? Does ssh do something with encoding? If I have a utf8 xterm window open on my machine, then ssh to the server running postgresql where the default locale is "POSIX" Then running: LANG=en_US.utf8 psql utf8test utf8test=> \encoding UNICODE utf8test=> select first_name from person where last_name = 'Anderson'; Then I see: Zo<C3><AB> But, if on that same remote machine I run a unicode xterm (uxterm in Debian) then in that xterm window I do: utf8test=> \encoding UNICODE utf8test=> select first_name from person where last_name = 'Anderson'; Zoë (correct) It's must slower running xterm remotely than using my local xterm and ssh, so it would be nice to be able to display the utf8. -- Bill Moseley moseley@xxxxxxxx