Hi, I've some trouble checking a corrupted 9T large ext3 fs which resides on a logical volume. The underlying physical volumes are three hardware raid systems, one of which started to crash frequently. I was able to pvmove away the data from the buggy system, so everything is fine now on the hardware side. However, the crashes left me with a seriously corrupted file system from which I'm trying to recover as much as possible. First step was to unmount the file system after users reported I/O errors when trying to open files. The system log contained many messages like [102445.420125] EXT3-fs error (device dm-2): ext3_free_blocks_sb: bit already cleared for block 544108393 and some of the form [160301.277477] EXT3-fs error (device dm-2): htree_dirblock_to_tree: bad entry in directory #153542738: rec_len % 4 != 0 - offset=0, inode=1381653864, +rec_len=26709, name_len=79 So I compiled the master branch of the e2fsprogs git repo as of Dec 1 (tip: 8680b4) and executed ./e2fsck -y -C0 /dev/mapper/abel-abt6_projects This ran for a while and then started to output a couple of these: Inode table for group 68217 is not in group. (block 825373744) WARNING: SEVERE DATA LOSS POSSIBLE. along with many lines of the form Illegal block #3036172 (4233778405) in inode 115335438. CLEARED. But then it continued just fine without printing further messsages. After about 4 hours it completed but decided to re-run from the beginning and this is where the real trouble seems to start. The next day I found thousands of lines like this on the console: /backup/data/solexa_analysis/ATH/MA/MA-30-29/run_30/4/length_42/reads_0.fl (inode #145326082, mod time Tue Jan 22 05:09:36 2008) followed by Clone multiply-claimed blocks? yes At this point the fsck seems to hang. No further messages, no progress bar for at least 17 hours. The lights on the raid system aren't flashing but there seems to be a bit of I/O going on as stracing the e2fsck process yields lseek(3, 6206310776832, SEEK_SET) = 6206310776832 read(3, "002107740635\tD\t2\t169\t35\t0\thhhhhh"..., 4096) = 4096 lseek(3, 1263113973760, SEEK_SET) = 1263113973760 write(3, "B9K@=?4C=L-F77F4:CGGK\n3\t14221118"..., 4096) = 4096 lseek(3, 5861641846784, SEEK_SET) = 5861641846784 read(3, "hhhhhh\tIIIIIIIIIIIIIIIIIIIIIIIII"..., 4096) = 4096 lseek(3, 1263113977856, SEEK_SET) = 1263113977856 write(3, "\t1.00\t0.46\t19\t4\t2\t0\t1\tA\t33\t31\t0\t"..., 4096) = 4096 There's about only one read per second, so the fsck might take rather long if it continues to run at this speed ;) It's running for 34 hours now and I don't know what to do, so here are a couple of questions for you ext3 gurus: Is there any hope this will ever complete? Should I abort the fsck and restart? Do things get even worse if I abort it and mount the file system r/o so that I can see whether important files are still there? Are there any magic e2fsck command line options I should try? The box is a 2xQuad Core Intel machine with 32G Ram and is running a vanilla 2.6.25.20 kernel. Any help is greatly appreciated. Thanks Andre -- The only person who always got his work done by Friday was Robinson Crusoe
Attachment:
signature.asc
Description: Digital signature