Re: encoding confusion

"Albe Laurenz" <laurenz.albe@xxxxxxxxxx> · Wed, 11 Jun 2008 07:58:16 +0200

Sim Zacks wrote:
> We originally tested it on mysql and now we are migrating it 
> to postgresql.
> 
> The messages are stored in a longblob field on mysql and a bytea field
> in postgresql.
> 
> I set the database up as UTF-8, even though we get emails that are not
> UTF encoded, mostly because I didn't know what else to try that would
> incorporate all the possible encodings. Examples of 3 encodings we
> regularly receive are: UTF-8, Windows-1255, ISO-8859-8-I.

[...]

> It would not transfer through the dbi-link, so I wrote a python script
> (see below) to read a row from mysql and write a row to postgresql
> (using pygresql and mysqldb).
> When I used pygresql's escape_bytea function to copy the data, it went
> smoothly, but the data was corrupt.
> When I tried the escape_string function it died because the data it was
> moving was not UTF-8.
> 
> I finally got it to work by defining a database as SQL-ASCII and then
> using escape_string worked. After the data was all in place, I pg_dumped
> and pg_restored into a UTF-8 database and it surprisingly works now.

It's very dificult to know what exactly happened unless you have some
examples of a byte sequence that illustrates what you describe:
How it looked in MySQL, how it looked in your Python script, what you
fed to escape_bytea.

What client encoding did you use in your Python script?

Yours,
Laurenz Albe