Re: Best practices for moving UTF8 databases

"Albe Laurenz" <laurenz.albe@xxxxxxxxxx> · Mon, 20 Jul 2009 09:34:02 +0200

Phoenix Kiula wrote:
> Really, PG absolutely needs a way to upgrade the database without so
> much data related downtime and all these silly woes. Several competing
> database systems are a cinch to upgrade.

I'd call it data corruption, not a silly woe.

I know that Oracle for example would not make that much fuss about
your data: they would be imported without even a warning, and
depending on your encoding settings the bad bytes would either be
imported as-is or tacitly changed to inverted (or normal) question
marks.

It's basically a design choice that PostgreSQL made: we think that
an error is preferrable to clandestinely modifying the user's data
or accepting input that cannot possibly make any sense when it is
retrieved at a future time.

> Anyway this is the annoying error I see as always:
> 
>   ERROR:  invalid byte sequence for encoding "UTF8": 0x80
> 
> I think my old DB is all utf8. If there are a few characters that are
> not, how can I work with this? I've done everything I can to take care
> of the encoding and such. This code was used to initdb:
> 
>  initdb --locale=en_US.UTF-8 --encoding=UTF8
> 
> Locale environment variables are all "en_US.UTF-8" too.

"0x80" makes me think of the following:
The data originate from a Windows system, where 0x80 is a Euro
sign. Somehow these were imported into PostgreSQL without the
appropriate translation into UTF-8 (how I do not know).

I wonder: why do you spend so much time complaining instead of
simply locating the buggy data and fixing them?

This does not incur any downtime (you can fix the data in the old
database before migrating), and it will definitely enhance the fun
your users have with your database (if they actually see Euros where
they should be).

Yours,
Laurenz Albe

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general