Re: Do I have a corrupted database?

William Garrison <postgres@xxxxxxxxxxxx> · Wed, 27 Aug 2008 13:45:43 -0400

Craig Ringer wrote:
William Garrison wrote:

I fear I have a corrupted database, and I'm not sure what to do.

First, make sure you have a recent backup. If your backups rotate, stop
the rotation so that all currently available historical copies of the
database are preserved from now on - just in case you need them.

Since I made my post, we found that we can't do a pg_dump. :(  Every 
time this error appears in the logs, postgres forcably closes any 
connections (including any running instances of pgadmin or pg_dump) when 
it runs this little recovery process.  We have backups from some days 
ago plus transaction logs.  We also have a snapshot of the file system, 
and I'm hoping to find a way to attach that onto another system.  I've 
had trouble with that in the past. 

As for the SAN and the Windows event log: Our IT guy says the SAN 
reported no failures at the time.  I don't know much about the SAN 
itself, I just know it uses dual fiber-channels and all the drives are 
in some RAID array.  I think it also is hardened against nuclear strikes 
and has a built-in laser defense system.  At the time of the problem, 
the Windows event log indicates no problems writing to the drives, or 
any other failures of any kind really.  No other apps crashed, no 
unusual memory usage, plenty of disk space.  So the cause is a complete 
mystery.  :(  So for now, I'm focused on repair.

We tried to REINDEX each table, and we are getting duplicate key errors 
so the reindex fails.  I can fix those records manually, but I was 
hoping to dump the database, find the duplicates using another system, 
then delete/repair the bad records and restore onto the production 
machine.  But since the backup/restore isn't working, that isn't looking 
like a viable option.

Are there any kind of repair tools for a postgres database?  Any sort of 
routine where I can take it offline and run like pg_fsck --all and it 
will come back with a report or a repair procedure?
Now, if possible dump your database with pg_dump. Restore the dump to a
test database instance and make sure that it all goes OK.

Once that's done, so you know you have a decent recovery point to work
from in case you make a mistake during your recovery efforts.

After that I don't have all that much to offer, especially as you're
using an operating system I don't have much experience with Pg on and
you're using an (unspecified) SAN.

Normally I'd ask if you'd verified your RAID array / tested your disks.
In this case, I'm wondering if there's any chance there was a service
interruption on the SAN that might've caused some sort of intermittent
or partial writes.

2008-08-23 20:00:27 ERROR:  xlog flush request E0/293CF278 is not
satisfied --- flushed only to E0/21B1B7F0
2008-08-23 20:00:27 CONTEXT:  writing block 94218 of relation
16712/16713/16725
2008-08-23 20:04:36 DETAIL:  Multiple failures --- write error may be
permanent.

Yeah, I'm really wondering about the SAN and SAN connection. What sort
of SAN is it? How is the host connected? Does it have any sort of
logging and monitoring that might let you see if there was a problem
around the time Pg was complaining?

Have you checked the Windows error logs?

--
Craig Ringer