Hi Tom,
On Wed, Oct 21, 2015 at 11:46 AM, Tom Lane <tgl@xxxxxxxxxxxxx> wrote:
bricklen <bricklen@xxxxxxxxx> writes:
> I get the following output for ctid, id, other_id, tstamp:
> (690651,42) | 318698967 | 347978007 | 2015-10-20 01:55:41.757+00
> (690651,43) | 318698968 | 347978008 | 2015-10-20 01:55:41.663+00
> (690651,44) | 318698969 | 347978009 | 2015-10-20 01:55:42.005+00
> ERROR: invalid page in block 1226710 of relation base/16750/27244
> It appears 690652 is what would be dd'd if that's the route we take. Is
> that accurate?
I'm confused by the block mentioned in the error message not having
anything to do with the TID sequence. I wonder whether it refers to an
index not the table proper. What query were you using to get this output,
exactly? Have you confirmed which relation has relfilenode 27244?
Yes, it is definitely a table. There was originally an index on that table which threw the original error (about sibling mismatch). I dropped the index and attempted to recreate it, which failed. Further investigation led to discovery of corruption in the table.
> Because the message indicates the corruption is in the table's page, not
> the page header, according to the docs zero_damaged_pages probably won't
> work.
I should think that zero_damaged_pages would work fine, if the problem
is indeed in the base table.
I will make note of that.
Another thing to keep in mind here, if you've got replication slaves,
is that I'm not sure whether the effects of zero_damaged_pages would
propagate to slaves. Have you investigated the possibility that some
slave has an uncorrupt copy that you could dd into place in the master?
We do have one uncorrupted slave, and one corrupted. I have a 4 hour delayed WAL-apply script that runs on the primary slaves in the disaster recovery data centres, and I stopped that process as soon as I saw the error about the sibling mismatch on the master. It is a viable candidate to fail over to, if we can swing a 20+ hour window of data loss. Right now that is an undesirable option.
> Is this the correct command if option #2 is chosen? Can it be executed
> against a running cluster?
> dd if=/dev/zero of=database/16750/27244 bs=8192 seek=690652 count=1
> conv=notrunc
Uh, no, you're not accounting for the fact that such an offset wouldn't be
in the first segment file of the relation.
Hmm, I wasn't sure about that. Thanks for confirming that.
As it stands, my next step is going to be a pg_dump of one of the up-to-date slaves (with corruption) but I will exclude the bad table. Given that I know the PK id range, I can COPY out the table's contents before and after the affected data. This way we can at least recover from backup if things get entirely borked.
The next part of the plan is to create a temporary version of the table with all data other than the corrupted range, then do some transaction-fu to rename the tables.
Thank you for your response, and any other insights are gratefully received.
Cheers,
Bricklen