Re: error while trying to change the database encoding on a database

Adrian Klaver <adrian.klaver@xxxxxxxxx> · Mon, 24 Jan 2011 12:44:58 -0800

On 01/24/2011 10:57 AM, Geoffrey Myers wrote:
Adrian Klaver wrote:
On 01/24/2011 09:16 AM, Geoffrey Myers wrote:

We hope to identify the characters and fix them in the existing
database, then convert. It appears to be very limited, but it would help
if there was some way to identify these characters outside of simply
doing the reload of the data and finding the errors.

Hence the reason I asked about a resource that might identify the
characters.

The problem is that from the standpoint of the SQL_ASCII database
there is nothing wrong with the characters per se. AFAIK there is no
built in function to validate characters. The reason is that valid is
determined by the encoding and if you know the encoding then you
really don't need to determine validity. If you want to see one way
others have tackled this, search on iconv in the mailing list archive.
This requires working on an external copy of the data and knowing
something about the encodings involved. The nearest I could ever find
to an encoding detector is:

http://chardet.feedparser.org/

It is a Python program and the encodings it detects are limited but it
might work for you.

Given all the above, when I was faced with the problem you are facing
I found it easiest to make an educated guess as to the original
encoding and then do test restores with client_encoding set to my guess.

Understood. We had figured the problem to be small, and it appears it is
and thus felt we could address it a character at a time. Then get this
error:

pg_restore: [archiver (db)] Error from TOC entry 5258; 0 17549 TABLE
DATA fax postgres
pg_restore: [archiver (db)] COPY failed: ERROR: invalid byte sequence
for encoding "UTF8": 0xe28053

That hex value doesn't translate to a single character. I've dumped the
data to a file as you suggested, but reviewing the identified line
brings no joy.

The only thing I can think of is to use iconv like:

iconv -c -t utf8 -f utf8 -o converted_txt.txt 'original.txt'

where original.txt is your plain text data dump. The -c switch causes 
iconv not to convert any illegal characters.

You could then run a diff against converted_txt.txt and 'original.txt' 
to see what characters in the original text are causing the problem.

--
Adrian Klaver
adrian.klaver@xxxxxxxxx

--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general