Re: new FAQ entry (was:Re: UTF8 problem)

Bruce Momjian <bruce@xxxxxxxxxx> · Mon, 21 Aug 2006 23:37:31 -0400 (EDT)

Instead of adding an FAQ entry, which might not be found when the error
is generated, I added a HINT for 8.2 that will appear with the error
message:

	 errmsg("invalid byte sequence for encoding \"%s\": 0x%s",
			pg_enc2name_tbl[encoding].name,
			buf),
	 errhint("This failure can also happen if the byte sequence does not "
	 		 "match the encoding expected by the server, which is controlled "
			 "by \"client_encoding\".")));

Supplying information at the point of error is usually the best
solution, if possible.

Backpatched to 8.1.X as well.

---------------------------------------------------------------------------

Tim Allen wrote:
> Matthew T. O'Connor wrote:
> > Well, to answer my own question, I hacked the source code of DBMail and 
> > had it set the client encoding to LATIN1 immediately after database 
> > connect, this seems to have fixed the problem.
> > 
> > Sorry for the noise,
> > 
> > Matt
> 
> I've seen this sort of problem asked about in the mailing lists often 
> enough to think it merits a FAQ entry, so how about this text:
> 
> <entry>
> Q. Why do I have problems inserting text into my database, with error 
> messages like
> 
> ERROR:  invalid byte sequence for encoding "UTF8": 0xe1202c ?
> 
> A. Almost certainly that byte sequence really is an invalid byte 
> sequence for that encoding. The reason you are seeing the error is 
> probably because you are providing text in some other encoding. You and 
> the database need to agree between you what encoding you're using. 
> PostgreSQL is fairly good at working with you, converting to and from 
> whatever encoding you want to use, but you need to tell it what that 
> encoding is, and then stick to that encoding consistently.
> 
> If you don't set the client encoding, then PostgreSQL will use the 
> default encoding for the database, which in modern times is often UTF8 
> (aka UNICODE), and is set at database creation time. However, many 
> client apps still use other encodings, (eg Latin1, aka ISO-8859-1), so 
> you need to either educate the client app to use UTF8, or get it to 
> inform PostgreSQL what other encoding to use.
> 
> The way to tell PostgreSQL what encoding you want to use is by use of 
> the client_encoding GUC variable, eg
> 
> set client_encoding to 'LATIN1';
> 
> One reason you may be seeing this problem now, after upgrading your 
> version of PostgreSQL, is that recent versions have tighter validation 
> of encoded text. Previously you may not have been conscious of what 
> encoding you were actually using, especially if you're a speaker of a 
> Western European language, and may have gotten away with writing 
> incorrectly-encoded text without the database complaining. Now is the 
> time to start getting it right.
> 
> One thing to be wary of is the "SQL_ASCII" encoding. It appears to be 
> commonly and incorrectly believed that this represents either some 
> variant on latin1, or pure 7-bit ASCII. It is neither of those, but a 
> completely unchecked encoding that really means whatever you want it to 
> mean. This makes it not a very good encoding to use in practice, as it 
> becomes prone to allowing a mixture of different encodings to be present 
> in the same set of data, which will cause you headaches when you try to 
> convert the whole lot to some consistent encoding in the future.
> 
> See section 21.2 of the documentation for more complete information.
> </entry>
> 
> Tim
> 
> -- 
> -----------------------------------------------
> Tim Allen          tim@xxxxxxxxxxxxxxxx
> Proximity Pty Ltd  http://www.proximity.com.au/
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
> 
>                http://www.postgresql.org/docs/faq

-- 
  Bruce Momjian   bruce@xxxxxxxxxx
  EnterpriseDB    http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +