Search Postgresql Archives

Re: Linux x Windows LOCALE/ENCODING compatibility

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 11/09/2010 02:31 AM, Carlos Henrique Reimer wrote:
Hi,

I'm currently in the process of moving the data from the Windows server
to the new Linux box but facing some problems with the encoding.

Additional configuration information: Windows is running PG 8.3 and the
new Linux box is PG 8.4.

Windows dump command:
pg_dump -U postgres -Fc -v -f "f:\backup

Linux restore command:
pg_restore -v -n brasil -d mapas /backup


pg_restore error:
pg_restore: [archiver (db)] Error while PROCESSING TOC:
pg_restore: [archiver (db)] Error from TOC entry 3574; 0 40805 TABLE
DATA cidade
  postgres
pg_restore: [archiver (db)] COPY failed: ERROR:  character 0x81 of
encoding "WIN
1252" has no equivalent in "UTF8"
CONTEXT:  COPY cidade, line 6

I also tried to dump using pg_dump -E UTF8 but then I got:
pg_dump: SQL command failed
pg_dump: Error message from server: ERROR:  character 0x81 of encoding
"WIN1252" has no equivalent in "UTF8"
pg_dump: The command was: COPY brasil.cidade (gid, "municpio",
"municpi0", uf, longitude, latitude, the_geom) TO stdout;
pg_dump: *** aborted because of error

How can I fix this error?


Well, that's pretty much evidence that the text you have on windows is NOT win1252-encoded. Or some parts of it, at least. According to this page:

http://en.wikipedia.org/wiki/Windows-1252

0x81 is undefined in win1252.

Please note that if the old DB has SQL_ASCII encoding, you may have mixed encoding text in the same database (either on purpose or by mistake) and you have to either keep using SQL_ASCII in the new DB (which means keeping potentially invalid data around), or do proper conversion to UTF-8 (which possibly has do be done differently for different tables, or even different rows in the same table) BEFORE importing it in the new DB.

To convert from one encoding to another, you can use, surprisingly, the 'convert' function. The tricky part is that once you have text of unknown encoding, you can only guess. That is, try to convert it from some reasonable encodings to UTF-8, and look at the result. If it looks right, that may be the right encoding. :)

BTW, 0x81 is a rather weird char, it's invalid in many common encodings. It's valid in win-1251, where it represents U+0403 (CYRILLIC CAPITAL LETTER GJE), Ð (if you don't have the font, you won't be able to see it), but I think it's an unlikely candidate.

In both CP437 and CP850 (old DOS encodings for western languages) it represents U+00FC (LATIN SMALL LETTER U WITH DIAERESIS), Ã (again, I hope you can see it). That's a better candidate, it's possible someone accessed the DB either directly, or via a web application, from a Windows ME or older system.

Unfortunately, AFAIK, PostgreSQL doesn't support the cp850 encoding. Maybe you can dump the data (even a single table) in text format and use a text editor (or shell utils like perl, sed, tr) to replace offending charaters with their proper win-1252 counterparts (e.g., 0x81 can be replaced with 0xfc, which is u with diaresis in win-1252).

I hope it helps.

.TM.

--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]
  Powered by Linux