Hi,
We have run into some corruption in one of our production tables. We know the cause (a compute node was moved), but now we need to fix the data. We have backups, but at this point they are nearly a day old, so recovering from them is a last-resort and will incur significant downtime.
We are running 9.3.9
Following the steps at http://blog.endpoint.com/2010/06/tracking-down-database-corruption-with.html
I get the following output for ctid, id, other_id, tstamp:
(690651,42) | 318698967 | 347978007 | 2015-10-20 01:55:41.757+00
(690651,43) | 318698968 | 347978008 | 2015-10-20 01:55:41.663+00
(690651,44) | 318698969 | 347978009 | 2015-10-20 01:55:42.005+00
ERROR: invalid page in block 1226710 of relation base/16750/27244
It appears 690652 is what would be dd'd if that's the route we take. Is that accurate?
Because the message indicates the corruption is in the table's page, not the page header, according to the docs zero_damaged_pages probably won't work.
What are my options?
1). Enable zero_damaged_pages and execute VACUUM FREEZE (and hope).
2). dd the block(s) using the output of the ctid query above.
It is multi-gigabyte table that is extremely heavily used (100's to 1000's of queries per second) so a VACUUM FULL or CLUSTER are options we'd really like to avoid if possible. The database is about 250GB, not huge, but big enough that slaves and backups are time consuming to redo, or recover from.
Will attempting zero_damaged_pages cause any harm as the first step (other than the obvious destruction of any bad pages)?
Is this the correct command if option #2 is chosen? Can it be executed against a running cluster?
dd if=/dev/zero of=database/16750/27244 bs=8192 seek=690652 count=1 conv=notrunc
Thanks,
Bricklen
We have run into some corruption in one of our production tables. We know the cause (a compute node was moved), but now we need to fix the data. We have backups, but at this point they are nearly a day old, so recovering from them is a last-resort and will incur significant downtime.
We are running 9.3.9
Following the steps at http://blog.endpoint.com/2010/06/tracking-down-database-corruption-with.html
I get the following output for ctid, id, other_id, tstamp:
(690651,42) | 318698967 | 347978007 | 2015-10-20 01:55:41.757+00
(690651,43) | 318698968 | 347978008 | 2015-10-20 01:55:41.663+00
(690651,44) | 318698969 | 347978009 | 2015-10-20 01:55:42.005+00
ERROR: invalid page in block 1226710 of relation base/16750/27244
It appears 690652 is what would be dd'd if that's the route we take. Is that accurate?
Because the message indicates the corruption is in the table's page, not the page header, according to the docs zero_damaged_pages probably won't work.
What are my options?
1). Enable zero_damaged_pages and execute VACUUM FREEZE (and hope).
2). dd the block(s) using the output of the ctid query above.
It is multi-gigabyte table that is extremely heavily used (100's to 1000's of queries per second) so a VACUUM FULL or CLUSTER are options we'd really like to avoid if possible. The database is about 250GB, not huge, but big enough that slaves and backups are time consuming to redo, or recover from.
Will attempting zero_damaged_pages cause any harm as the first step (other than the obvious destruction of any bad pages)?
Is this the correct command if option #2 is chosen? Can it be executed against a running cluster?
dd if=/dev/zero of=database/16750/27244 bs=8192 seek=690652 count=1 conv=notrunc
Thanks,
Bricklen