Possible data corruption with Postgres 7.4.8

"Eric B. Ridge" <ebr@xxxxxxxx> · Mon, 13 Mar 2006 22:33:50 -0500

First off, let me make clear that I blame the strange data corruption  
problem we encountered today on our hardware raid controller -- some  
versions of its firmware are known-to-be-buggy and cause the raid set  
to "drop" off, and we've encountered this problem in the past on this  
particular server.  I also want to blame the corruption on pg7.4.8.   
Later versions seem to have fixes for (unspecific) "corruption"  
issues.  But hardware seems much more likely.

The environment for this server is:
   - pg7.4.8
   - 2x2.4GHz Xeons, HT
   - 4gig ram
   - Linux kernel version 2.6.15
   - Dell PowerEdge Expandable RAID Controller 3/Di (rev 01) which  
supposedly contains the latest firmware patches (tho I dunno how they  
relate to "rev 01").

Issue 1
-------

A single record quietly and silently disappeared in the middle of the  
day.  We have no daily vacuum processes (only vacuum at night), stats  
collection is turned off, and our application has no facility for  
actually DELETE-ing records (at least not from the containing  
table).  It's pure UPDATE, INSERT, and SELECT.  No foreign key  
references that might cause cascading deletes.  And no manual  
intervention via psql (that anyone will fess up to!).  From an  
application standpoint, there's simply no explanation for a record  
disappearing.

Anyways, we were playing around with pg_filedump on the affected  
table and discovered that the missing record in question had 212 old  
versions, all with exactly this infomask:

   infomask: 0x2513 (HASNULL|HASVARWIDTH|HASOID|XMIN_COMMITTED| 
XMAX_COMMITTED|UPDATED)

I don't claim to fully understand what that mask actually means, but  
from what I can gather from looking at the masks of other known-to-be- 
good records, is that each of the 212 versions are "dead" in terms of  
MVCC rules, and that the expiration occurred as the result of an  
update to that version while it was "alive".  Is this correct (or  
kinda close)?  If I'm guessing correctly it seems we should have had  
at least one entry with the XMAX_INVALID flag set... that entry being  
the most-current version of the record.  Am I even providing enough  
information here for someone else to guess?

At first we feared transaction id wraparound, but the database  
cluster is being vacuumed every night and the age(datfrozenxid) was  
nowhere near the "2 billion" mark.  We were roughly 73 million past 1  
billion.  But we've never seen wraparound before and don't really  
know what age(datfrozenxid) would return.  However, it's damn near  
impossible we would have run >1 billion transactions in the 9 hours  
since the last VACUUM.

Also the pg7.4 docs indicate that VACUUM will output a warning  
message when nearing the 1 billion transaction mark.  In all our  
VACUUM logs, we've never seen this.

Which leads to...

Issue 2
-------

Last night's VACUUM process output this for a number of indexes:
   WARNING:  index "foo_pkey" contains 1348636 row versions, but  
table contains 1348635 row versions

however none of the indexes were on the table with the missing  
record.  All the indexes for "foo_pkey"'s table output the same  
warning message with the same number of index and table row versions  
-- all off by one.  We couldn't decide if each index was corrupt or  
if only the table was corrupt or both.

And we were seeing all the really great things you'd expect to see  
when indexes don't jive with their tables.  SELECTS using index scans  
were returning the wrong records, while sequential scans would return  
the correct record (or no record at all, which was expected).

Finally, after running some consistency checks against previous  
backups all we could decide to do was upgrade to pg7.4.12, REINDEX  
every index, vacuum the thing again, restore the one missing record,  
and move on.

We did save a binary copy of the entire installation tree, including  
$PGDATA in case further investigation is necessary.

Does anyone here have any kind of explanation other than bad hardware?

eric