Search Postgresql Archives

Re: [HACKERS] Invalid unicode in COPY problem

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Tatsuo Ishii wrote:
> Sent: Sunday, May 08, 2005 3:31 PM
> To: John Hansen
> Cc: linux@xxxxxxxxxxx; pgsql-general@xxxxxxxxxxxxxx; 
> pgsql-hackers@xxxxxxxxxxxxxx
> Subject: Re: [HACKERS]  Invalid unicode in COPY problem
> 
> > Tatsuo Ishii wrote:
> > > Sent: Sunday, May 08, 2005 12:01 PM
> > > To: linux@xxxxxxxxxxx
> > > Cc: pgsql-general@xxxxxxxxxxxxxx; pgsql-hackers@xxxxxxxxxxxxxx
> > > Subject: Re: [HACKERS]  Invalid unicode in COPY problem
> > > 
> > > We have developed patches which relaxes the character 
> validation so 
> > > that PostgreSQL accepts invalid characters. It works like this:
> > 
> > That is just plain 100% wrong!!
> >
> > Under no circumstances should there be invalid data in a database.
> > And if you're trying to make a database of invalid data, 
> then at least 
> > encode it using a valid encoding.
> > 
> > In fact, I've proposed strengthening the validation 
> routines for UTF-8.
> 
> Actually I myself thought as you are before. Later I found 
> that it was not so good idea. People already have invalid 
> encoded data in their precious database and have very hard 
> time to migrate to newer version of PostgreSQL because of 
> encoding validation.
> 
> Think about this kind of situation:
> 
> There is a table t1(member_id integer primary key, 
> member_name text, address text, phone text, email text). I 
> have to reach each member by either adress, phone or email. 
> Unfortunately some of address field have wrong encoded data. 
> In this case I will use phone or email to reach them. 
> 
> Now I need to upgrade to newer PostgreSQL within 1 day. I 
> know I have to fix wrong encoded field but it will take more 
> than 1 day. So I would like to import the data first then fix 
> wrong encoded field on running database since I can reach 
> members by phone or email even with wrong encoded address field...

Actually would be very simple, create function isvalidutf8(text) in your
preferred language.
C source is available from unicode.org.
Create function converttoutf8(text) using whatever code is required to
transform the _wrong_ encoding (SQL_ASCII -> UTF8 for instance) to
utf-8.
Update table set field=converttoutf8(field) where !isvalidutf8(field);

Now sit back and relax while your invalid data is converted to utf-8.

When done, pg_dump the database, upgrade, and reload.

This should take less than a day.

> I saw this kind of situation in the real world and that's why 
> we developed the patches.
> --
> Tatsuo Ishii
> 
> 

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]
  Powered by Linux