On 11/09/2010 02:31 AM, Carlos Henrique Reimer wrote:
Hi,
I'm currently in the process of moving the data from the Windows server
to the new Linux box but facing some problems with the encoding.
Additional configuration information: Windows is running PG 8.3 and the
new Linux box is PG 8.4.
Windows dump command:
pg_dump -U postgres -Fc -v -f "f:\backup
Linux restore command:
pg_restore -v -n brasil -d mapas /backup
pg_restore error:
pg_restore: [archiver (db)] Error while PROCESSING TOC:
pg_restore: [archiver (db)] Error from TOC entry 3574; 0 40805 TABLE
DATA cidade
postgres
pg_restore: [archiver (db)] COPY failed: ERROR: character 0x81 of
encoding "WIN
1252" has no equivalent in "UTF8"
CONTEXT: COPY cidade, line 6
I also tried to dump using pg_dump -E UTF8 but then I got:
pg_dump: SQL command failed
pg_dump: Error message from server: ERROR: character 0x81 of encoding
"WIN1252" has no equivalent in "UTF8"
pg_dump: The command was: COPY brasil.cidade (gid, "municpio",
"municpi0", uf, longitude, latitude, the_geom) TO stdout;
pg_dump: *** aborted because of error
How can I fix this error?
Well, that's pretty much evidence that the text you have on windows is
NOT win1252-encoded. Or some parts of it, at least. According to this page:
http://en.wikipedia.org/wiki/Windows-1252
0x81 is undefined in win1252.
Please note that if the old DB has SQL_ASCII encoding, you may have
mixed encoding text in the same database (either on purpose or by
mistake) and you have to either keep using SQL_ASCII in the new DB
(which means keeping potentially invalid data around), or do proper
conversion to UTF-8 (which possibly has do be done differently for
different tables, or even different rows in the same table) BEFORE
importing it in the new DB.
To convert from one encoding to another, you can use, surprisingly, the
'convert' function. The tricky part is that once you have text of
unknown encoding, you can only guess. That is, try to convert it from
some reasonable encodings to UTF-8, and look at the result. If it looks
right, that may be the right encoding. :)
BTW, 0x81 is a rather weird char, it's invalid in many common encodings.
It's valid in win-1251, where it represents U+0403 (CYRILLIC CAPITAL
LETTER GJE), Ð (if you don't have the font, you won't be able to see
it), but I think it's an unlikely candidate.
In both CP437 and CP850 (old DOS encodings for western languages) it
represents U+00FC (LATIN SMALL LETTER U WITH DIAERESIS), Ã (again, I
hope you can see it). That's a better candidate, it's possible someone
accessed the DB either directly, or via a web application, from a
Windows ME or older system.
Unfortunately, AFAIK, PostgreSQL doesn't support the cp850 encoding.
Maybe you can dump the data (even a single table) in text format and use
a text editor (or shell utils like perl, sed, tr) to replace offending
charaters with their proper win-1252 counterparts (e.g., 0x81 can be
replaced with 0xfc, which is u with diaresis in win-1252).
I hope it helps.
.TM.
--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general