new FAQ entry (was:Re: UTF8 problem)

Tim Allen <tim@xxxxxxxxxxxxxxxx> · Thu, 08 Jun 2006 17:00:22 +1000

Matthew T. O'Connor wrote:
Well, to answer my own question, I hacked the source code of DBMail and 
had it set the client encoding to LATIN1 immediately after database 
connect, this seems to have fixed the problem.

Sorry for the noise,

Matt

I've seen this sort of problem asked about in the mailing lists often 
enough to think it merits a FAQ entry, so how about this text:

<entry>
Q. Why do I have problems inserting text into my database, with error 
messages like

ERROR:  invalid byte sequence for encoding "UTF8": 0xe1202c ?

A. Almost certainly that byte sequence really is an invalid byte 
sequence for that encoding. The reason you are seeing the error is 
probably because you are providing text in some other encoding. You and 
the database need to agree between you what encoding you're using. 
PostgreSQL is fairly good at working with you, converting to and from 
whatever encoding you want to use, but you need to tell it what that 
encoding is, and then stick to that encoding consistently.

If you don't set the client encoding, then PostgreSQL will use the 
default encoding for the database, which in modern times is often UTF8 
(aka UNICODE), and is set at database creation time. However, many 
client apps still use other encodings, (eg Latin1, aka ISO-8859-1), so 
you need to either educate the client app to use UTF8, or get it to 
inform PostgreSQL what other encoding to use.

The way to tell PostgreSQL what encoding you want to use is by use of 
the client_encoding GUC variable, eg

set client_encoding to 'LATIN1';

One reason you may be seeing this problem now, after upgrading your 
version of PostgreSQL, is that recent versions have tighter validation 
of encoded text. Previously you may not have been conscious of what 
encoding you were actually using, especially if you're a speaker of a 
Western European language, and may have gotten away with writing 
incorrectly-encoded text without the database complaining. Now is the 
time to start getting it right.

One thing to be wary of is the "SQL_ASCII" encoding. It appears to be 
commonly and incorrectly believed that this represents either some 
variant on latin1, or pure 7-bit ASCII. It is neither of those, but a 
completely unchecked encoding that really means whatever you want it to 
mean. This makes it not a very good encoding to use in practice, as it 
becomes prone to allowing a mixture of different encodings to be present 
in the same set of data, which will cause you headaches when you try to 
convert the whole lot to some consistent encoding in the future.

See section 21.2 of the documentation for more complete information.
</entry>

Tim

--
-----------------------------------------------
Tim Allen          tim@xxxxxxxxxxxxxxxx
Proximity Pty Ltd  http://www.proximity.com.au/