Re: client_encoding / postgresql strange behaviour

Michael Fuhr <mike@xxxxxxxx> · Tue, 16 Sep 2008 08:39:20 -0600

On Tue, Sep 16, 2008 at 05:15:41AM -0600, Enrico Sabbadin wrote:
> Hi, as far as I've understood client_encoding tells postgresql 
> how data "will arrive / must be send back" to the client application. 
> Postgresql will do the conversion for you if the database encoding is different. 

Correct.

> I've a unicode database and the line 
> client_encoding=latin1 in the postgresql configuration file 
> (this is the same as calling set client_encoding='latin1' on any connection right?).

Correct.

> i insert one row containing the Hex 92 characther using the npgsql net driver : 
> I sniff the wire and i see that that only the hex 92 bvbyte is sent.
> 
> i insert one row containing the Hex 92 characther using the npgsql net driver 
> setting Encoding=UNICODE in the cnstring: 
> I sniff the wire and i see that that 3 bytes are sent : the coorect equivalent
> in unicode (i see that a set client_ecoeding=UNICODE is sent).

That's wrong: if the original encoding is latin1 then 0x92 should
have a 2-byte UTF-8 sequence of 0xc2 0x92, which represents <U+0092
PRIVATE USE TWO>, a C1 control character.  I'd guess that the
original encoding is really win1252 (aka Windows Latin 1), where
0x92 would have a 3-byte UTF-8 sequence of 0xe2 0x80 0x99, representing
<U+2019 RIGHT SINGLE QUOTATION MARK>.  Is that the character you're
expecting?

> Now i want to read the data using : 
> 1) if i read the first row keeping client_encoding latin1 everything is
>    OK (i get hex 92 back)

The database converted 0x92 to 0xc2 0x92 on the way in, then converted
that character back to 0x92 on the way out.  Your display is apparently
interpreting this character the way you expect (presumably as <U+2019
RIGHT SINGLE QUOTATION MARK>) even though the database didn't interpret
it that way.

> 2) if i read the 2nd row setting client_encoding unicode everything is
>    OK (i get the 3 unicode bytes)

Since you sent a 3-byte UTF-8 sequence the database is apparently
storing the character you expect, presumably <U+2019 RIGHT SINGLE
QUOTATION MARK>.

> 3) if i read the first row setting client_encoding unicode IT DOES NOT
>    WORK I get a 2 byte sequence c2 92 ?? so i see garbage.

The database stored the first row as 0xc2 0x92 <U+0092 PRIVATE USE TWO>
because client_encoding was set to latin1.  When you retrieved that
character with client_encoding set to UNICODE (UTF8 in recent
versions of Postgres) you got 0xc2 0x92 back.

> WHY this does not work (can't i write in one encoding and read in another?) 
> Looks like postgresql did not correctly converted from latin1 to unicode
> and has gargabage inside (but why do i read correctly the first line when
> i have client_encoding latin1) ? 

Postgres probably did perform the correct conversion but your display
is really something other than latin1, probably win1252 or another
Windows encoding.  Try setting client_encoding to win1252, which is
supported in 8.1 and later.  What version are you running?  Since you
refer to UNICODE (8.0 and earlier) instead of to UTF8 (8.1 and later)
it's possible that you're running a version that doesn't support
win1252.

-- 
Michael Fuhr