Autovacuum and invalid page header

Ireneusz Pluta <ipluta@xxxxx> · Wed, 12 May 2010 13:11:12 +0200

I had a problem with neverending forced autovacuum process, running as 
preventing xid wraparound. As I finally (?) found, following some 
advices given here: ->  
http://forums.enterprisedb.com/posts/list/2028.page, that autovacuum in 
question was not just one autovacuum, but many different autovacuums on 
the same table cycling in start-fail scenario because of invalid page 
headers of some indexes of the table. Manual VACUUM VERBOSE said me 
that, now I can also see that in related ERROR-CONTEXT log message 
pairs. I dropped the damaged indexes and now that autovacuum seems to 
continue without repeating itself.

Anyway, another questions come from that situation and I'd like to 
discuss them here and get some advice.

First, it seems that an unnoticed damage of a relation, causing 
autovacuum failures, which might initially be harmless for the whole 
cluster (as far as only damaged relation and its disk space is 
concerned), may lead to a critical situation when a wraparound gets 
close, and autovacuum is in prevenitng mode can't freeze its xids. The 
forced autovacuum insists on vacuuming the table which can not be 
vacuumed and locks itself there. At the same time, autovacuum get 
sticked with a database of the damaged relation, and "forgets" about 
other databases which might need me vacuumed. From this point of view, 
just one damaged table might be a single point of failure of the whole 
cluster. Is there any way, other than tracing logs, to prevent such a 
situation?

Another issue, rather loosely related to the topic. I suspect that the 
cluster may have more invalid page headers like that. They might be 
caused by past bad sector failures of one of the drives of my raid 
array. So I think it might be a good idea to check the whole cluster, 
page by page, for invalid page headers. Is there any ready tool, which, 
for instance when given a path to database cluster, would traverse all 
cluster directories and files and check all page headers? I probably 
answered myself - manual VACUUM [VERBOSE] would do - but it fails when 
finds the first invalid header, so I would have. That's why I am looking 
for something only scanning and reporting invalid headers.

Yet another issue: how could that happen, the bad sectors of a RAID10 
member drive caused page damages? (as far as I am right in this 
assumption, of course, but I have a reason to think so). Should not the 
RAID array prevent such a file damage?

Thanks

Irek.

--
Sent via pgsql-admin mailing list (pgsql-admin@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin