Re: encoding question

"Ben K." <bkim@xxxxxxxxxxxx> · Tue, 21 Mar 2006 10:39:28 -0600 (CST)

ERROR:  invalid UTF-8 byte sequence detected near byte 0x85
Looks to me like it might have been meant as LATIN1 or one of
the other single-byte ASCII-extension encodings.

Thanks. Indeed it has non-ascii and wouldn't be covered by SQL_ASCII, I 
see now.

I never suspected there'd be non-ascii in the data since we do cleansing 
before script-loading the data, but we use other input methods too, so am 
not sure where they came from.

I didn't specify encoding when doing initdb when upgrading to 8.1.0, and 
think it was where I could have prevented this problem, but I'm not sure.

I'm suspecting so because of this article (At least for locale C - since I 
did not specify encoding and got UTF on linux with en_US.UTF-8). Is it 
valid for 8.1.0?

http://www.commandprompt.com/ppbook/x17149
"ENCODING = encoding
...
If the ENCODING keyword is unspecified, PostgreSQL will create a
database using its default encoding. This is usually SQL_ASCII, though
it may have been set to a different default during the initial
configuration of PostgreSQL (see Chapter 2 for more on default
encoding)."

And I'm getting this from pgAdmin III. I guess this is the reason why 
you all say avoid SQL_ASCII?

"Database encoding The database ... is created to store data using
the SQL_ASCII encoding. This encoding is defined for 7 bit characters
only; the meaning of characters with the 8th bit set (non-ASCII
characters 127-255) is not defined. Consequently, it is not possible for
the server to convert the data to other encodings. If you're storing
non-ASCII data in the database, you're strongly encouraged to use a
proper database encoding representing your locale character set to take
benefit from the automatic conversion to different client encodings when
needed. If you store non-ASCII data in an SQL_ASCII database, you may
encounter weird characters written to or read from the database, caused
by code conversion problems. This may cause you a lot of headache when
accessing the database using different client programs and drivers. For
most installations, Unicode (UTF8) encoding will provide the most
flexible capabilities."

Could anyone comment if the method in this url is valid and reasonably 
safe? (At this time the problem seems almost harmless except for a few 
records not being loaded, but it'll need to be fixed.)

http://archives.postgresql.org/pgsql-general/2004-02/msg01192.php

dump database, recode the dump, drop database, restore from recoded dump

Especially, any experience with recode vs. manual inspection ?

I'm just reasoning from pieces of information. I'd appreciate any advices 
or experiences.

Regards,

Ben K.
Developer
http://benix.tamu.edu