Re: FSCK of corrupted ext3 filesystem

Matt Bernstein <mb/ext3@xxxxxxxxxxxxxx> · Fri, 19 Aug 2005 13:32:29 +0100 (BST)

On May 23 Darryl Bond wrote:

I have a 1.3TB ext3 filesystem that has been in service for about 3 months.
About 6 days ago the Emulex fibrechannel controller logged a SCSI error and 
the filesystem changed to RO.
It appears that the filesystem instantly changes to RO and prevents the 
journal from working, therefore invalidating the filesystem.
The filesystem was unmounted and a remount was attempted. The mount failed due 
to errors and an fsck came up with errors.

Top output looks like this:

PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
4562 root          25   0  780m   214m  236 R 99.9         42.6   6211:44 
fsck.ext3

I'm seeing something rather similar, and not for the first time :-\

The MD layer failed a drive (on a 3ware Escalade card), but somehow the fs 
got wind of this and aborted the journal.

My fsck is on an Opteron, it's entirely CPU-bound, occupying about 1.4G of 
my 2G RAM, stuck in pass 2 six days in. My strace isn't picking up any 
calls.

My question is basically the same as Darryl's. How long do I give it?

(I did SIGKILL an earlier invocation as I hadn't passed the "-y" option.)

As my volume is all backup data, I'm willing to poke at it with debugfs if 
people on this list think it's worth a try. Maybe I can mark it as not 
having errors, and try to mount it? Or maybe there's a way of making fsck 
less thorough?

I don't like the idea of not having backups for more than a week. What I 
did last time this happened was to run mke2fs and start again from 
scratch. Can I do better this time?

Matt

_______________________________________________

Ext3-users@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/ext3-users