Re: finding rows with invalid characters

Jasen Betts <jasen@xxxxxxxxxx> · 30 Nov 2010 19:08:47 GMT

On 2010-11-21, Sim Zacks <sim@xxxxxxxxxxxxxx> wrote:
> I am using PG 8.2.17 with UTF8 encoding.
> "PostgreSQL 8.2.17 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 4.1.1 
> (Gentoo 4.1.1)"
>
> One of my tables somehow has invalid characters in it:
>> ERROR:  invalid byte sequence for encoding "UTF8": 0xa9
>> HINT:  This error can also happen if the byte sequence does not match 
>> the encoding expected by the server, which is controlled by 
>> "client_encoding".
> I have already manually found a number of the bad rows by running 
> queries with text functions (upper) between groups of IDs until I found 
> the specific bad row.
>

> 1) Is there a quicker way to get a list of all rows with invalid characters

dumpthe table, run it through iconv , diff agaist the original.

> 2) Shouldn't the database prevent these rows from being entered in the 
> first place?

it should have, but that bug has now been fixed.

> 3) I have backups of this database (using -Fc) and I noticed that on 
> restore, this table is not restored because of this error. Is there a 
> way to fix the existing backups, or tell the restore to ignore bad rows 
> instead of erroring out the whole table?

translate them to SQL (use pg_resore with no databse name) 
then you can again use iconv to clean them.

use iconv something like this.

  iconv --from-code UTF8 --to-code UTF8 -c  < input_file > output_file

This will translate surrogates and drop other invalid characters.

if you have any constraints that place lower bounds on string-length
this has the potential to break them.

-- 
ââ 100% natural

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general