Problems with checking corrupted large ext3 file system

Andre Noll <maan@xxxxxxxxxxxxxxx> · Wed, 3 Dec 2008 11:11:00 +0100

Hi,

I've some trouble checking a corrupted 9T large ext3 fs which resides
on a logical volume. The underlying physical volumes are three hardware
raid systems, one of which started to crash frequently. I was able
to pvmove away the data from the buggy system, so everything is fine
now on the hardware side.

However, the crashes left me with a seriously corrupted file system
from which I'm trying to recover as much as possible. First step was
to unmount the file system after users reported I/O errors when trying
to open files. The system log contained many messages like

	[102445.420125] EXT3-fs error (device dm-2): ext3_free_blocks_sb: bit already cleared for block 544108393                                              

and some of the form

	[160301.277477] EXT3-fs error (device dm-2): htree_dirblock_to_tree: bad entry in directory #153542738: rec_len % 4 != 0 - offset=0, inode=1381653864, +rec_len=26709, name_len=79

So I compiled the master branch of the e2fsprogs git repo as of
Dec 1 (tip: 8680b4) and executed

	./e2fsck -y -C0 /dev/mapper/abel-abt6_projects

This ran for a while and then started to output a couple of these:

	Inode table for group 68217 is not in group.  (block 825373744)
	WARNING: SEVERE DATA LOSS POSSIBLE.

along with many lines of the form

	Illegal block #3036172 (4233778405) in inode 115335438.                                                                                                
        CLEARED.

But then it continued just fine without printing further
messsages. After about 4 hours it completed but decided to re-run from
the beginning and this is where the real trouble seems to start. The
next day I found thousands of lines like this on the console:

        /backup/data/solexa_analysis/ATH/MA/MA-30-29/run_30/4/length_42/reads_0.fl (inode #145326082, mod time Tue Jan 22 05:09:36 2008)

followed by

	Clone multiply-claimed blocks? yes

At this point the fsck seems to hang. No further messages, no progress
bar for at least 17 hours. The lights on the raid system aren't
flashing but there seems to be a bit of I/O going on as stracing the
e2fsck process yields

	lseek(3, 6206310776832, SEEK_SET)       = 6206310776832
	read(3, "002107740635\tD\t2\t169\t35\t0\thhhhhh"..., 4096) = 4096
	lseek(3, 1263113973760, SEEK_SET)       = 1263113973760
	write(3, "B9K@=?4C=L-F77F4:CGGK\n3\t14221118"..., 4096) = 4096
	lseek(3, 5861641846784, SEEK_SET)       = 5861641846784
	read(3, "hhhhhh\tIIIIIIIIIIIIIIIIIIIIIIIII"..., 4096) = 4096
	lseek(3, 1263113977856, SEEK_SET)       = 1263113977856
	write(3, "\t1.00\t0.46\t19\t4\t2\t0\t1\tA\t33\t31\t0\t"..., 4096) = 4096

There's about only one read per second, so the fsck might take rather
long if it continues to run at this speed ;)

It's running for 34 hours now and I don't know what to do, so here are
a couple of questions for you ext3 gurus:

	Is there any hope this will ever complete?

	Should I abort the fsck and restart?

	Do things get even worse if I abort it and mount the file
	system r/o so that I can see whether important files are
	still there?

	Are there any magic e2fsck command line options I should try?

The box is a 2xQuad Core Intel machine with 32G Ram and is running
a vanilla 2.6.25.20 kernel. Any help is greatly appreciated.

Thanks
Andre
-- 
The only person who always got his work done by Friday was Robinson Crusoe
Attachment:
signature.asc

Description: Digital signature