Re: error while trying to change the database encoding on a database

Adrian Klaver <adrian.klaver@xxxxxxxxx> · Mon, 24 Jan 2011 08:20:16 -0800

On Monday 24 January 2011 8:06:38 am Geoffrey Myers wrote:
> Adrian Klaver wrote:
> > On Monday 24 January 2011 7:57:52 am Geoffrey Myers wrote:
> >> Adrian Klaver wrote:
> >>> On Monday 24 January 2011 6:38:55 am Geoffrey Myers wrote:
> >>>> We need to change the database encoding on our databases as they were
> >>>> created with the wrong encoding.  They were created as SQL_ASCII and
> >>>> we are changing them to UTF8.
> >>>>
> >>>> When testing this Friday, I received the following error:
> >>>>
> >>>> pg_restore: [archiver (db)] Error while PROCESSING TOC:
> >>>> pg_restore: [archiver (db)] Error from TOC entry 5225; 0 16990 TABLE
> >>>> DATA cust postgres
> >>>> pg_restore: [archiver (db)] COPY failed: ERROR:  invalid byte sequence
> >>>> for encoding "UTF8": 0xb0
> >>>> HINT:  This error can also happen if the byte sequence does not match
> >>>> the encoding expected by the server, which is controlled by
> >>>> "client_encoding".
> >>>> CONTEXT:  COPY cust, line 778
> >>>
> >>>                         ^^^^^^^ In the COPY command for that table.
> >>
> >> I picked up ont that, but the dump is binary, thus I can not view the
> >> actual code.
> >
> > Actually you can :) I should have mentioned it before. You can have
> > pg_restore restore to a file instead of a database by using the -f
> > switch. When you do that it creates plain text output. You could restore
> > the entire dump to the file or use the -t switch to get only the table
> > you need.
>
> Thanks for the suggestion.  As it stands, we are getting different
> errors for different hex characters, thus the solution we need is the
> ability to identify the characters that won't convert from SQL_ASCII to
> UTF8.  Is there a resource that would identify these characters?
>

Well the issue is that SQL_ASCII is not an encoding. From the docs:
http://www.postgresql.org/docs/9.0/interactive/multibyte.html#MULTIBYTE-CHARSET-SUPPORTED
"Thus, this setting is not so much a declaration that a specific encoding is in 
use, as a declaration of ignorance about the encoding. In most cases, if you 
are working with any non-ASCII data, it is unwise to use the SQL_ASCII setting 
because PostgreSQL will be unable to help you by converting or validating 
non-ASCII characters. "

What you need to do is determine what applications where putting data into the 
database and what encoding they are using. I ran into this a couple of years 
back with an app that was using WIN1252 for data being inserted into a couple 
of tables in a SQL_ASCII database . Once I knew the encoding I dumped the table 
schema only for those tables into a new UTF8 database. Using psql I set the 
client_encoding to WIN1252 and then used \i to pull in a plain text data only 
dump for each table.

>
> --
> Until later, Geoffrey
>
> "I predict future happiness for America if they can prevent
> the government from wasting the labors of the people under
> the pretense of taking care of them."
> - Thomas Jefferson

-- 
Adrian Klaver
adrian.klaver@xxxxxxxxx

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general