Re: unconvertable characters

Michael Fuhr <mike@xxxxxxxx> · Mon, 16 Jul 2007 08:19:10 -0600

On Mon, Jul 16, 2007 at 04:20:22PM +0300, Sim Zacks wrote:
> My 8.0.1 database is using ISO_8859_8 encoding. When I select specific 
> fields I get a warning:
> WARNING:  ignoring unconvertible ISO_8859_8 character 0x00c2

Did any of the data originate on Windows?  Might the data be in
Windows-1255 or some encoding other than ISO-8859-8?  In Windows-1255
0xc2 represents <U+05B2 HEBREW POINT HATAF PATAH> -- does that
character seem correct in the context of the data?

http://en.wikipedia.org/wiki/Windows-1255

> I now want to upgrade my database to 8.2.4 and change the encoding to UTF-8.
> When the restore is done, I get the following errors:
> pg_restore: restoring data for table "manufacturers_old"
> pg_restore: [archiver (db)] Error from TOC entry 4836; 0 9479397 TABLE DATA 
> manufacturers postgres
> pg_restore: [archiver (db)] COPY failed: ERROR:  character 0xc2 of encoding 
> "ISO_8859_8" has no equivalent in "UTF8"
> CONTEXT:  COPY manufacturers_old, line 331
> 
> And no data is put into the table.
> Is there a function I can use to replace the unconvertable charachters to 
> blanks?

If the data is in an encoding other than ISO-8859-8 then you could
redirect the output of pg_restore to a file or pipe it through a
filter and change the "SET client_encoding" line to whatever the
encoding really is.  For example, if the data is Windows-1255 then
you'd use the following:

SET client_encoding TO win1255;

Another possibility would be to use a command like iconv to convert
the data to UTF-8 and strip unconvertible characters; on many systems
you could do that with "iconv -f iso8859-8 -t utf-8 -c".  If you
convert to UTF-8 then you'd need to change client_encoding accordingly.

-- 
Michael Fuhr