Re: [HACKERS] Invalid unicode in COPY problem

Tatsuo Ishii <t-ishii@xxxxxxxxx> · Sun, 08 May 2005 14:30:45 +0900 (JST)

> Tatsuo Ishii wrote:
> > Sent: Sunday, May 08, 2005 12:01 PM
> > To: linux@xxxxxxxxxxx
> > Cc: pgsql-general@xxxxxxxxxxxxxx; pgsql-hackers@xxxxxxxxxxxxxx
> > Subject: Re: [HACKERS]  Invalid unicode in COPY problem
> > 
> > We have developed patches which relaxes the character 
> > validation so that PostgreSQL accepts invalid characters. It 
> > works like this:
> 
> That is just plain 100% wrong!!
>
> Under no circumstances should there be invalid data in a database.
> And if you're trying to make a database of invalid data, then at 
> least encode it using a valid encoding.
> 
> In fact, I've proposed strengthening the validation routines for UTF-8.

Actually I myself thought as you are before. Later I found that it was
not so good idea. People already have invalid encoded data in their
precious database and have very hard time to migrate to newer version
of PostgreSQL because of encoding validation.

Think about this kind of situation:

There is a table t1(member_id integer primary key, member_name text,
address text, phone text, email text). I have to reach each member by
either adress, phone or email. Unfortunately some of address field
have wrong encoded data. In this case I will use phone or email to
reach them. 

Now I need to upgrade to newer PostgreSQL within 1 day. I know I have
to fix wrong encoded field but it will take more than 1 day. So I
would like to import the data first then fix wrong encoded field on
running database since I can reach members by phone or email even with
wrong encoded address field...

I saw this kind of situation in the real world and that's why we
developed the patches.
--
Tatsuo Ishii

---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend