Jigar Shah <jshah@xxxxxxxxxxx> writes: > We had some disk issues on the primary, but raid verification corrected > those blocks. That may have caused the primary to be corrupt. "corrected" for small values of "corrected", I'm guessing :-( > I have identified the objects, they both are indexes > relname | relfilenode | relkind > ------------------------+-------------+--------- > feedback_packed_pkey | 114846 | i > feedback_packed_id_idx | 115085 | i Hm, well, the good news is you could reindex both of those, the bad is that there are certainly more problems than this. > The secondary is the most recent copy. If we could just tell the secondary > to go passed beyond that corrupt block and get the database started, we > can then divert traffic to the secondary so our system can run read-only > until we can isolate and fix our primary. But the secondary is stuck at > this point and wont start. Is there a way to make the secondary do that? > Is there a way to remove that block from the wal file its applying so it > can go passed that point? No. You could probably make use of the PITR functionality to let the secondary replay up to just short of the WAL record where corruption becomes apparent, then stop and come up normally. The problem here is that it seems unlikely that the detected-inconsistent WAL record is the first bit of corruption that's been passed to the secondary. I don't have a lot of faith in the idea that your troubles would be over if you could only fire up the secondary. It's particularly worrisome that you seem to be looking for ways to avoid a dump/restore. That should be your zeroth-order priority at this point. What I would do if I were you is to take a filesystem backup of the secondary's entire current state (WAL and data directory) so that you can get back to this point if you have to. Then try the PITR stop-at-this-point trick. Be prepared to restore from the filesystem backup and recover to some other stopping point, possibly a few times, to get to the latest point that doesn't have clear corruption. Meanwhile you could be trying to get the master into a better state. It's not immediately obvious which path is going to lead to a better outcome faster, but I wouldn't assume the secondary is in better shape than the primary. On the master, again it seems like a filesystem dump ought to be the first priority, mainly so that you still have the data if the disks continue the downward arc that it sounds like they're on. In short: you're in for a long day, but your first priority ought to be to make sure things can't get even worse. regards, tom lane -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general