Search Postgresql Archives

new FAQ entry (was:Re: UTF8 problem)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Matthew T. O'Connor wrote:
Well, to answer my own question, I hacked the source code of DBMail and had it set the client encoding to LATIN1 immediately after database connect, this seems to have fixed the problem.

Sorry for the noise,

Matt

I've seen this sort of problem asked about in the mailing lists often enough to think it merits a FAQ entry, so how about this text:

<entry>
Q. Why do I have problems inserting text into my database, with error messages like

ERROR:  invalid byte sequence for encoding "UTF8": 0xe1202c ?

A. Almost certainly that byte sequence really is an invalid byte sequence for that encoding. The reason you are seeing the error is probably because you are providing text in some other encoding. You and the database need to agree between you what encoding you're using. PostgreSQL is fairly good at working with you, converting to and from whatever encoding you want to use, but you need to tell it what that encoding is, and then stick to that encoding consistently.

If you don't set the client encoding, then PostgreSQL will use the default encoding for the database, which in modern times is often UTF8 (aka UNICODE), and is set at database creation time. However, many client apps still use other encodings, (eg Latin1, aka ISO-8859-1), so you need to either educate the client app to use UTF8, or get it to inform PostgreSQL what other encoding to use.

The way to tell PostgreSQL what encoding you want to use is by use of the client_encoding GUC variable, eg

set client_encoding to 'LATIN1';

One reason you may be seeing this problem now, after upgrading your version of PostgreSQL, is that recent versions have tighter validation of encoded text. Previously you may not have been conscious of what encoding you were actually using, especially if you're a speaker of a Western European language, and may have gotten away with writing incorrectly-encoded text without the database complaining. Now is the time to start getting it right.

One thing to be wary of is the "SQL_ASCII" encoding. It appears to be commonly and incorrectly believed that this represents either some variant on latin1, or pure 7-bit ASCII. It is neither of those, but a completely unchecked encoding that really means whatever you want it to mean. This makes it not a very good encoding to use in practice, as it becomes prone to allowing a mixture of different encodings to be present in the same set of data, which will cause you headaches when you try to convert the whole lot to some consistent encoding in the future.

See section 21.2 of the documentation for more complete information.
</entry>

Tim

--
-----------------------------------------------
Tim Allen          tim@xxxxxxxxxxxxxxxx
Proximity Pty Ltd  http://www.proximity.com.au/


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]
  Powered by Linux