Re: Problems with checking corrupted large ext3 file system

Theodore Tso <tytso@xxxxxxx> · Thu, 4 Dec 2008 14:51:38 -0500

On Thu, Dec 04, 2008 at 05:37:59PM +0100, Andre Noll wrote:
> OK, so I guess I would like to run e2fsck again without cloning those
> blocks.

Actually, what you should probably do is to take a look at the inodes
which were listed in the pass1b, and if they don't make sense, to
clear them out.  An individual inode can be cleared by using the
debugfs clri command.  i.e., to zero out inode 12345, you do this:

debugfs -w /dev/mapper/thunk-closure
debugfs: clri <12345>
debugfs: quit

This doesn't work very easily if there is a large number of inodes
that contain garbage, though.  I don't have tools that deal well with
wholeslae corruption of large portions of the inode table, mostly
because those tools, if misused, could actually cause a lot more harm
than good, and so designing the proper safety mechanism so they are
safe to use in the hands of system administrators that are not
filesystem experts and tend to use commands like "fsck -y" is very
difficult to get right.  It's also a failure mode which happens
rarely, so it's never been a high priority to figure out how create
tools that can safely handle this problem in the general case.

If you're convinced that all of the inode tables greater than 4TB have
been corrupted, or blocks from a particular physical volume are *all*
toast, onesolution is to zero out all of the damaged blocks, on the
theory that there's nothing to save anyway, and e2fsck is trying hard
to save all possible data --- and if you know there's nothing to save
there, clearing the parts of the inode table that ar eknown to be bad,
will make e2fsck run more cleanly.

In the long run, I can imagine enhancements to ext4 where we reserve 4
bytes in each inode which are used to collectively to store
information to assure ourselves an inode table block really contains
valid data and not random garbage.  The first inode in an inode table
block would use the 4 byte field to store the first inode number in
the itable block.  The second inode in the inode table block would
store the block number for the current itable block.  Each subsequent
inode, for up to 32 inodes, would use the 4 byte field to store
successive 4 bytes of the filesystem UUID.  This would allow e2fsck to
validate whether a particular inode table block read in from disk
really was valid or not.  (I'm deliberately not including an actual
checksum since that would complicate matters, and if we are going to
store a checksum, we should have one set of fields which indicates
that this block belongs to filesystem XYZ's inode table starting at
position A, and another set of fields that indicates whether a one or
more bits in the itable block have gotten flipped.  The two are
different concepts and how we react may differ depending on what know
is incorrect.)

> > One option is to use the Lustre e2fsprogs which has a patch that tries
> > to detect such "garbage" inodes and wipe them clean, instead of trying
> > to continue using them.
> > 
> > 	http://downloads.lustre.org/public/tools/e2fsprogs/latest/
> > 
> > That said, it may be too late to help because the previous e2fsck run
> > will have done a lot of work to "clean up" the garbage inodes and they
> > may no longer be above the "bad inode threshold".
> 
> I would love to give it a try if it gets me an intact file system
> within hours rather than days or even weeks because it avoids the
> lengthy algorithm that clones the multiply-claimed blocks.

Well, it's still worth a shot.

> As the box is running a Ubuntu, I could not install the rpm directly.
> So I compiled the source from e2fsprogs-1.40.11.sun1.tar.gz which is
> contained in e2fsprogs-1.40.11.sun1-0redhat.src.rpm. gcc complained
> about unsafe format strings but produced the e2fsck executable.
> 
> Do I need to use any command line option to the patched e2fsck? And
> is there anything else I should consider before killing the currently
> running e2fsck?

Nope, try it and let us know whether it seems to work.  It might be
possible to augment the hueristics to detect the bad inodes (i.e.,
check to see if the modtimes/ctimes are totally looks reflect times
that are totally outside of what might be considered "normal" times as
an indication of itable block's sanity.

But long-term (although it probably won't help you), we should
seriously think about adding some inode sanity-check fields whose
primary purpose is to tell us whether a itable block is likely to be
valid or garbage.

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html