Tatsuo Ishii wrote: > Sent: Sunday, May 08, 2005 3:31 PM > To: John Hansen > Cc: linux@xxxxxxxxxxx; pgsql-general@xxxxxxxxxxxxxx; > pgsql-hackers@xxxxxxxxxxxxxx > Subject: Re: [HACKERS] Invalid unicode in COPY problem > > > Tatsuo Ishii wrote: > > > Sent: Sunday, May 08, 2005 12:01 PM > > > To: linux@xxxxxxxxxxx > > > Cc: pgsql-general@xxxxxxxxxxxxxx; pgsql-hackers@xxxxxxxxxxxxxx > > > Subject: Re: [HACKERS] Invalid unicode in COPY problem > > > > > > We have developed patches which relaxes the character > validation so > > > that PostgreSQL accepts invalid characters. It works like this: > > > > That is just plain 100% wrong!! > > > > Under no circumstances should there be invalid data in a database. > > And if you're trying to make a database of invalid data, > then at least > > encode it using a valid encoding. > > > > In fact, I've proposed strengthening the validation > routines for UTF-8. > > Actually I myself thought as you are before. Later I found > that it was not so good idea. People already have invalid > encoded data in their precious database and have very hard > time to migrate to newer version of PostgreSQL because of > encoding validation. > > Think about this kind of situation: > > There is a table t1(member_id integer primary key, > member_name text, address text, phone text, email text). I > have to reach each member by either adress, phone or email. > Unfortunately some of address field have wrong encoded data. > In this case I will use phone or email to reach them. > > Now I need to upgrade to newer PostgreSQL within 1 day. I > know I have to fix wrong encoded field but it will take more > than 1 day. So I would like to import the data first then fix > wrong encoded field on running database since I can reach > members by phone or email even with wrong encoded address field... Actually would be very simple, create function isvalidutf8(text) in your preferred language. C source is available from unicode.org. Create function converttoutf8(text) using whatever code is required to transform the _wrong_ encoding (SQL_ASCII -> UTF8 for instance) to utf-8. Update table set field=converttoutf8(field) where !isvalidutf8(field); Now sit back and relax while your invalid data is converted to utf-8. When done, pg_dump the database, upgrade, and reload. This should take less than a day. > I saw this kind of situation in the real world and that's why > we developed the patches. > -- > Tatsuo Ishii > > ---------------------------(end of broadcast)--------------------------- TIP 7: don't forget to increase your free space map settings