ext4 corruption

Micah Anderson <micah@xxxxxxxxxx> · Sun, 05 Jun 2011 23:59:34 -0400

I previously wrote about a recent conversion from ext3 to ext4 (on
Debian Squeeze), which went well. However, I seem to be having problems
with the ext4 filesystem.

Yesterday, there was a file in /var/spool/postfix/defer that was giving
an i/o error:

Jun  3 15:00:14 willet postfix/qmgr[29108]: fatal: qmgr_message_alloc:
677AE298316F: remove defer 677AE298316F: Input/output error

If I tried to stat it, it would give the same error. I noticed on the
console, I was getting a lot of these:

[6060479.296658] EXT4-fs error (device dm-4): ext4_lookup: deleted inode referenced: 169640807
[6060482.776087] JBD: Spotted dirty metadata buffer (dev = dm-4, blocknr = 0). There's a risk of filesystem corruption in case of 
                  system crash.

The system was clearly acting strange, so I decided it was best to touch
/forcefsk and restart to clean up the filesystem.

I got a couple Multiply-claimed block(s), "(There are 10 inodes
containing multiply-claimed blocks.)", and then I was required to run
fsck again, which I did and it seemed to be fine after the second run
(these fscks took hours). 

After things seemed clean, I started the system back up and it began to
operate fine. I then began to see the following on the console:

[ 3201.702997] EXT4-fs error (device dm-4): mb_free_blocks: double-free of inode 0's block 56429952(bit 3456 in group 1722)
[ 3201.714348] EXT4-fs error (device dm-4): mb_free_blocks: double-free of inode 0's block 56429953(bit 3457 in group 1722)
[ 3201.725665] EXT4-fs error (device dm-4): mb_free_blocks: double-free of inode 0's block 56429954(bit 3458 in group 1722)
[ 3201.737028] EXT4-fs error (device dm-4): mb_free_blocks: double-free of inode 0's block 56429955(bit 3459 in group 1722)
[ 3201.748721] EXT4-fs error (device dm-4): mb_free_blocks: double-free of inode 0's block 56429956(bit 3460 in group 1722)
[ 3201.760021] EXT4-fs error (device dm-4): mb_free_blocks: double-free of inode 0's block 56429957(bit 3461 in group 1722)
[ 3201.771489] EXT4-fs error (device dm-4): mb_free_blocks: double-free of inode 0's block 56429958(bit 3462 in group 1722)
[ 3201.782908] EXT4-fs error (device dm-4): mb_free_blocks: double-free of inode 0's block 56429959(bit 3463 in group 1722)
[ 3201.794281] EXT4-fs error (device dm-4): mb_free_blocks: double-free of inode 0's block 56429960(bit 3464 in group 1722)
[ 3201.805664] EXT4-fs error (device dm-4): mb_free_blocks: double-free of inode 0's block 56429961(bit 3465 in group 1722)
[ 3201.818936] JBD: Spotted dirty metadata buffer (dev = dm-4, blocknr = 0). There's a risk of filesystem corruption in case of system crash.
[ 3202.289345] JBD: Spotted dirty metadata buffer (dev = dm-4, blocknr = 0). There's a risk of filesystem corruption in case of system crash.
[ 3202.328925] JBD: Spotted dirty metadata buffer (dev = dm-4, blocknr = 0). There's a risk of filesystem corruption in case of system crash.

I'm concerned that this happened so quickly after a fsck resolved
issues.

The filesystem is on top of a software raid mirror, so I failed one set
and ran S.M.A.R.T. short/long tests on the device, re-added it to the
array, waited the 8hours for the resync, and then did the same thing
with the other element of the array. All smart tests completed without
error.

I took the machine down to add another disk to the system so I could
have more flexibility to be able to run badblocks tests, and when the
system came back up a fsck of the partition was required. Its been
running for 3 hours now, and so far it has only said "Duplicate or bad
block in use!" so I presume it is scanning the entire device for
duplicate blocks. This is what it did the previous fsck. 

Last time it took 8 hours to complete the first pass, and then it had to
do another pass after a reboot, which took 1.5-4hrs (i was sleeping when
it finished). So we've out for a number of hours now, which is quite
bad. 

Its certainly possible that this is not a filesystem issue, and instead
a hardware one, the badblocks tests should give us more conclusive
information. I would love any additional suggestions for what we can do
to conclusively identify what the issue is.

thanks for reading, and any thoughts you might have!

micah
Attachment:
pgp9jJL2SvqJf.pgp

Description: PGP signature