Trying to recover a corrupted database

John Scalia <jayknowsunix@xxxxxxxxx> · Thu, 17 Jul 2014 15:10:16 -0400

Hi all,

You may have seen my post from yesterday about our production database getting corrupted. Well, this morning we brought the system down to single user and ran an fsck which did 
report some drive errors. We repeated until no additional errors were reported. Then, we brought the system back to multi-user status and ran a successful pg_basebackup on the 
broken database. Since then we restarted the database and a ps -ef result looks like:

/usr/pgsql-9.2/bin/postmaster -D /opt/datacenter -o -c zero_damaged_pages=true -i -N 384 -p 5431

After the Db started up, we ran a VACUUM FULL ANALYZE which ran for about 3 hours, but the database is still showing the same type of errors in its log: invalid page header in 
block 29718... etc. What disturbed me a little, is that I don't think the zero_damaged_pages got applied. Checking the pg_settings table, we got:

select name, setting, boot_val, reset_val from pg_settings where name = 'zero_damaged_pages';
                  name             |  setting  |  boot_val  | reset_val
---------------------------------------------------------------------------------
 zero_damaged_pages  |  on          | off              | on

Now, my colleague ran this after he tried running some operations again after I told him how to set zero_damaged_pages again. He swears that that it was on when the first VACUUM 
FULL ANALYZE was run, but I'm not as sure. Plus, I don't understand why the boot_val shows as off. In any event, as we're still getting log errors like before, I don't really know 
what to try next other than rerunning the VACUUM FULL again. Help?
--
Jay