I previously wrote about a recent conversion from ext3 to ext4 (on Debian Squeeze), which went well. However, I seem to be having problems with the ext4 filesystem. Yesterday, there was a file in /var/spool/postfix/defer that was giving an i/o error: Jun 3 15:00:14 willet postfix/qmgr[29108]: fatal: qmgr_message_alloc: 677AE298316F: remove defer 677AE298316F: Input/output error If I tried to stat it, it would give the same error. I noticed on the console, I was getting a lot of these: [6060479.296658] EXT4-fs error (device dm-4): ext4_lookup: deleted inode referenced: 169640807 [6060482.776087] JBD: Spotted dirty metadata buffer (dev = dm-4, blocknr = 0). There's a risk of filesystem corruption in case of system crash. The system was clearly acting strange, so I decided it was best to touch /forcefsk and restart to clean up the filesystem. I got a couple Multiply-claimed block(s), "(There are 10 inodes containing multiply-claimed blocks.)", and then I was required to run fsck again, which I did and it seemed to be fine after the second run (these fscks took hours). After things seemed clean, I started the system back up and it began to operate fine. I then began to see the following on the console: [ 3201.702997] EXT4-fs error (device dm-4): mb_free_blocks: double-free of inode 0's block 56429952(bit 3456 in group 1722) [ 3201.714348] EXT4-fs error (device dm-4): mb_free_blocks: double-free of inode 0's block 56429953(bit 3457 in group 1722) [ 3201.725665] EXT4-fs error (device dm-4): mb_free_blocks: double-free of inode 0's block 56429954(bit 3458 in group 1722) [ 3201.737028] EXT4-fs error (device dm-4): mb_free_blocks: double-free of inode 0's block 56429955(bit 3459 in group 1722) [ 3201.748721] EXT4-fs error (device dm-4): mb_free_blocks: double-free of inode 0's block 56429956(bit 3460 in group 1722) [ 3201.760021] EXT4-fs error (device dm-4): mb_free_blocks: double-free of inode 0's block 56429957(bit 3461 in group 1722) [ 3201.771489] EXT4-fs error (device dm-4): mb_free_blocks: double-free of inode 0's block 56429958(bit 3462 in group 1722) [ 3201.782908] EXT4-fs error (device dm-4): mb_free_blocks: double-free of inode 0's block 56429959(bit 3463 in group 1722) [ 3201.794281] EXT4-fs error (device dm-4): mb_free_blocks: double-free of inode 0's block 56429960(bit 3464 in group 1722) [ 3201.805664] EXT4-fs error (device dm-4): mb_free_blocks: double-free of inode 0's block 56429961(bit 3465 in group 1722) [ 3201.818936] JBD: Spotted dirty metadata buffer (dev = dm-4, blocknr = 0). There's a risk of filesystem corruption in case of system crash. [ 3202.289345] JBD: Spotted dirty metadata buffer (dev = dm-4, blocknr = 0). There's a risk of filesystem corruption in case of system crash. [ 3202.328925] JBD: Spotted dirty metadata buffer (dev = dm-4, blocknr = 0). There's a risk of filesystem corruption in case of system crash. I'm concerned that this happened so quickly after a fsck resolved issues. The filesystem is on top of a software raid mirror, so I failed one set and ran S.M.A.R.T. short/long tests on the device, re-added it to the array, waited the 8hours for the resync, and then did the same thing with the other element of the array. All smart tests completed without error. I took the machine down to add another disk to the system so I could have more flexibility to be able to run badblocks tests, and when the system came back up a fsck of the partition was required. Its been running for 3 hours now, and so far it has only said "Duplicate or bad block in use!" so I presume it is scanning the entire device for duplicate blocks. This is what it did the previous fsck. Last time it took 8 hours to complete the first pass, and then it had to do another pass after a reboot, which took 1.5-4hrs (i was sleeping when it finished). So we've out for a number of hours now, which is quite bad. Its certainly possible that this is not a filesystem issue, and instead a hardware one, the badblocks tests should give us more conclusive information. I would love any additional suggestions for what we can do to conclusively identify what the issue is. thanks for reading, and any thoughts you might have! micah
Attachment:
pgp9jJL2SvqJf.pgp
Description: PGP signature