After dumping a database (pg_dump -F c database > dump), trying to restore it (pg_restore dump) gives:
> pg_restore: [archiver (db)] Error from TOC entry 2463; 0 58451 TABLE DATA table user
> pg_restore: [archiver (db)] COPY failed: ERROR:Â invalid byte sequence for encoding "UTF8": 0xe3273a
> HINT:Â This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by "client_encoding".
> CONTEXT:Â COPY table, line 1
The surface reason for this is clear enough: invalid UTF-8 data crept into some tsvector columns. This is a much more serious problem, however: the backup tools for the database are, without warning, generating data that can't be restored.
When in a data recovery situation, a backup that won't restore is catastrophic. I can't restore the database to a state it was in at the time of the backup; I have to spend hours of downtime figuring out what to do to make something usable out of my backup; and then I have to hope I've corrected the backup correctly before bringing the server back online. (If I was in an actual backup recovery
situation--fortunately I'm not--I'd be more inclined to edit the Postgresql source to disable
this check while restoring the backup than to risk trying to manually fix the backup data directly, which is very easy to get wrong.)
Recommendations:
- So long as there's any possibility of this happening, it should be possible to force Postgresql to ignore this error, to guarantee that backups can be restored. If the database allowed this situation to happen in the first place, then it should allow it to be restored from a backup too.
- There was no warning of any problem when the backup was made. pg_dump should warn about potential recovery problems in the data it's outputting. This should be a fatal error unless explicitly overridden, so backups that can't be restored won't be generated accidentally.
I don't know how this data entered the database in the first place. The problematic data is (so far) entirely in tsvectors, generated from to_tsvector and/or tsvector_update_trigger, but I havn't reproduced it. Updating the rows in question on the server (to cause its tsvector_update_trigger to be fired) fixes the problematic tsvector columns. For what it's worth, 0xe3273a in the dump is within the string "'x':17" where x is \xe3, the first byte of the UTF-8 representation of U+30FC "ã". (If this sounds like a known or fixed problem I'd be interested to know, but this sort of problem in a minor subsystem like FTS shouldn't be able to silently break backups in the first place.)
--
Glenn Maynard