Re: Unicode problem again

Michael Fuhr <mike@xxxxxxxx> · Thu, 26 Jun 2008 08:41:07 -0600

On Thu, Jun 26, 2008 at 03:31:01PM +0200, Albe Laurenz wrote:
> Michael Fuhr wrote:
> > Your input data seems to have a mix of encodings: sometimes you're
> > getting pound signs in a non-UTF-8 encoding, but if characters like
> > <U+2019 RIGHT SINGLE QUOTATION MARK> got into the database when
> > client_encoding was set to UTF8 then at least some data must have
> > been in UTF-8.
> 
> Sorry, but that's not true.
> That character is 0x9s in WINDOWS-1252.

I think you mean 0x92.

> So it could have been that client_encoding was (correctly) set to WIN1252
> and the quotation mark was entered as a single byte character.

Yes, *if* client_encoding was set to win1252.  However, in the
following thread Garry said that he was getting encoding errors
when entering the pound sign that were resolved by changing
client_encoding (I suggested latin1, latin9, or win1252; he doesn't
say which he used):

http://archives.postgresql.org/pgsql-general/2008-06/msg00526.php

If client_encoding had been set to win1252 then Garry wouldn't have
gotten encoding errors when entering the pound sign because that
character is 0xa3 in win1252 (also in latin1 and latin9). So either
applications are setting client_encoding to different values,
sometimes correctly and sometimes incorrectly (Garry, do you know
if that could be happening?), or the data is sometimes in different
encodings.  If the data is being entered via a web application then
the latter seems more likely, at least in my experience (I've had
to deal with exactly this problem recently).

-- 
Michael Fuhr