Re: [RFD] Incremental fsck

Theodore Tso <tytso@xxxxxxx> · Sat, 12 Jan 2008 09:51:40 -0500

On Wed, Jan 09, 2008 at 02:52:14PM +0300, Al Boldi wrote:
> 
> Ok, but let's look at this a bit more opportunistic / optimistic.
> 
> Even after a black-out shutdown, the corruption is pretty minimal, using 
> ext3fs at least.
>

After a unclean shutdown, assuming you have decent hardware that
doesn't lie about when blocks hit iron oxide, you shouldn't have any
corruption at all.  If you have crappy hardware, then all bets are off....

> So let's take advantage of this fact and do an optimistic fsck, to
> assure integrity per-dir, and assume no external corruption.  Then
> we release this checked dir to the wild (optionally ro), and check
> the next.  Once we find external inconsistencies we either fix it
> unconditionally, based on some preconfigured actions, or present the
> user with options.

So what can you check?  The *only* thing you can check is whether or
not the directory syntax looks sane, whether the inode structure looks
sane, and whether or not the blocks reported as belong to an inode
looks sane.

What is very hard to check is whether or not the link count on the
inode is correct.  Suppose the link count is 1, but there are actually
two directory entries pointing at it.  Now when someone unlinks the
file through one of the directory hard entries, the link count will go
to zero, and the blocks will start to get reused, even though the
inode is still accessible via another pathname.  Oops.  Data Loss.

This is why doing incremental, on-line fsck'ing is *hard*.  You're not
going to find this while doing each directory one at a time, and if
the filesystem is changing out from under you, it gets worse.  And
it's not just the hard link count.  There is a similar issue with the
block allocation bitmap.  Detecting the case where two files are
simultaneously can't be done if you are doing it incrementally, and if
the filesystem is changing out from under you, it's impossible, unless
you also have the filesystem telling you every single change while it
is happening, and you keep an insane amount of bookkeeping.

One that you *might* be able to do, is to mount a filesystem readonly,
check it in the background while you allow users to access it
read-only.  There are a few caveats, however ---- (1) some filesystem
errors may cause the data to be corrupt, or in the worst case, could
cause the system to panic (that's would arguably be a
filesystem/kernel bug, but we've not necessarily done as much testing
here as we should.)  (2) if there were any filesystem errors found,
you would beed to completely unmount the filesystem to flush the inode
cache and remount it before it would be safe to remount the filesystem
read/write.  You can't just do a "mount -o remount" if the filesystem
was modified under the OS's nose.

> All this could be per-dir or using some form of on-the-fly file-block-zoning.
> 
> And there probably is a lot more to it, but it should conceptually be 
> possible, with more thoughts though...

Many things are possible, in the NASA sense of "with enough thrust,
anything will fly".  Whether or not it is *useful* and *worthwhile*
are of course different questions!  :-)

						- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html