Once more: Recovering a damaged ext4 fs?

"J.D. Bakker" <jdb@xxxxxxxxxxxx> · Fri, 27 Mar 2009 21:41:21 +0100

Hi all,

My 4TB ext4 RAID-6 has been damaged again. Symptoms leading up to it 
were very similar to the last time (see 
http://article.gmane.org/gmane.comp.file-systems.ext4/11418 ): a 
process attempted to delete a large (~2GB) file, resulting in a soft 
lockup with the following call trace:

 [<ffffffff80526dd7>] ? _spin_lock+0x16/0x19
 [<ffffffff80317b49>] ? ext4_mb_init_cache+0x81c/0xa58
 [<ffffffff80281249>] ? __lru_cache_add+0x8e/0xb6
 [<ffffffff80279d37>] ? find_or_create_page+0x62/0x88
 [<ffffffff80317ec2>] ? ext4_mb_load_buddy+0x13d/0x326
 [<ffffffff80318385>] ? ext4_mb_free_blocks+0x2da/0x75e
 [<ffffffff802c02d7>] ? __find_get_block+0xc6/0x1bc
 [<ffffffff802feebb>] ? ext4_free_blocks+0x7f/0xb2
 [<ffffffff8031294b>] ? ext4_ext_truncate+0x3e3/0x854
 [<ffffffff80306e38>] ? ext4_truncate+0x67/0x5bd
 [<ffffffff8032594e>] ? jbd2_journal_dirty_metadata+0x124/0x146
 [<ffffffff80314d44>] ? __ext4_handle_dirty_metadata+0xac/0xb7
 [<ffffffff803024c1>] ? ext4_mark_iloc_dirty+0x432/0x4a9
 [<ffffffff80303177>] ? ext4_mark_inode_dirty+0x135/0x166
 [<ffffffff803074e0>] ? ext4_delete_inode+0x152/0x22e
 [<ffffffff8030738e>] ? ext4_delete_inode+0x0/0x22e
 [<ffffffff802b44ac>] ? generic_delete_inode+0x82/0x109
 [<ffffffff802acd44>] ? do_unlinkat+0xf7/0x150
 [<ffffffff802a380c>] ? vfs_read+0x11e/0x133
 [<ffffffff80527545>] ? page_fault+0x25/0x30
 [<ffffffff8020c0ea>] ? system_call_fastpath+0x16/0x1

Kernel is 2.6.29-rc6. Machine is still responsive to anything that 
doesn't touch the ext4 file system, but fails to halt. Upon power 
cycling fsck fails with:

 newraidfs: Superblock has an invalid ext3 journal (inode 8).
 CLEARED.
 *** ext3 journal has been deleted - filesystem is now ext2 only ***

 newraidfs: Note: if several inode or block bitmap blocks or part
 of the inode table require relocation, you may wish to try
 running e2fsck with the '-b 32768' option first.  The problem
 may lie only with the primary block group descriptors, and
 the backup block group descriptors may be OK.

 newraidfs: Block bitmap for group 0 is not in group.  (block 3273617603)

 newraidfs: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
 	(i.e., without -a or -p options)

A manual e2fsck -nv /dev/md0 reported:

 e2fsck 1.41.4 (27-Jan-2009)
 ./e2fsck/e2fsck: Group descriptors look bad... trying backup blocks...
 Block bitmap for group 0 is not in group.  (block 3273617603)
 Relocate? no
 Inode bitmap for group 0 is not in group.  (block 3067860682)
 Relocate? no
 Inode table for group 0 is not in group.  (block 3051956899)
 WARNING: SEVERE DATA LOSS POSSIBLE.
 Relocate? no
 Group descriptor 0 checksum is invalid.  Fix? no
 Inode table for group 1 is not in group.  (block 1842273247)
 WARNING: SEVERE DATA LOSS POSSIBLE.
 Relocate? no
 Group descriptor 1 checksum is invalid.  Fix? no
 Inode bitmap for group 2 is not in group.  (block 3148026909)
 Relocate? no
 Inode table for group 2 is not in group.  (block 1321535690)
 WARNING: SEVERE DATA LOSS POSSIBLE.
 Relocate? no
 Group descriptor 2 checksum is invalid.  Fix? no
 [...]
 ./e2fsck/e2fsck: Invalid argument while reading bad blocks inode
 This doesn't bode well, but we'll try to go on...
 Pass 1: Checking inodes, blocks, and sizes
 Illegal block number passed to ext2fs_test_block_bitmap #3051956899 
for in-use block map
 Illegal block number passed to ext2fs_mark_block_bitmap #3051956899 
for in-use block map
 Illegal block number passed to ext2fs_test_block_bitmap #3051956900 
for in-use block map
 Illegal block number passed to ext2fs_mark_block_bitmap #3051956900 
for in-use block map
 [...]

Full logs available at:
  http://lartmaker.nl/ext4/e2fsck-md0-20090327.txt
  http://lartmaker.nl/ext4/e2fsck-md0-32768-20090327.txt
  http://lartmaker.nl/ext4/e2fsck-md0-98304-20090327.txt

I've run dumpe2fs:
  http://lartmaker.nl/ext4/dumpe2fs-md0-20090327.txt
  http://lartmaker.nl/ext4/dumpe2fs-md0-32768-20090327.txt
  http://lartmaker.nl/ext4/dumpe2fs-md0-98304-20090327.txt
...but it worries me that all three start with "ext2fs_read_bb_inode: 
Invalid argument".

This is linux-2.6.29-rc6 (x86_64) running on an Intel Core i7 920 
processor (quad core plus hyperthreading). Kernel config is 
http://lartmaker.nl/ext4/kernel-config-20090327.txt ; dmesg is at 
http://lartmaker.nl/ext4/dmesg-20090327.txt

So,
- is there a way to recover my file system? I do have backups of most 
data,but as my remote weeklies run on Saturdays I'd still lose a lot 
of work
- is ext4 on software raid-6 on x86_64 considered production stable? 
I have been getting these hangs almost monthly, which is a lot worse 
than my old ext3 software RAID.

Thanks,

JDB.

--
LART. 250 MIPS under one Watt. Free hardware design files.
http://www.lartmaker.nl/
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html