Re: topics for the file system mini-summit

Andreas Dilger <adilger@xxxxxxxxxxxxx> · Tue, 30 May 2006 00:14:29 -0600

On May 29, 2006  15:29 -0400, Ric Wheeler wrote:
> Andreas Dilger wrote:
> >Instead of a filesystem-wide "error" bit we could move this per-group to
> >only mark the block or inode bitmaps in error if they have a checksum
> >failure.  This would prevent allocations from that group to avoid further
> >potential corruption of the filesystem metadata.
> >
> >Once an error is detected then a filesystem service thread or a userspace
> >helper would walk the inode table (starting in the current group, which
> >is most likely to hold the relevant data) recreating the respective bitmap
> >table and keeping a "valid bit" bitmap as well.  Once all of the bits
> >in the bitmap are marked valid then we can start using this group again.
>
> That is a neat idea - would you lose complete access to the impacted 
> group, or have you thought about "best effort" read-only while under repair?

I think we would only need to prevent new allocation from the group if the
bitmap is corrupted.  The extent format already has a magic number to give
a very quick sanity check (unlike indirect blocks which can be filled with
random garbage on large filesystems and still appear valid).  We are looking
at adding checksums in the extent metadata and could also do extra internal
consistency checks to validate this metadata (e.g. sequential ordering of
logical offsets, non-overlapping logical offsets, proper parent->child
logical offset heirarchy, etc).

So, we are mostly safe from the "incorrect block free" side, and just need
to worry about the "block is free in bitmap, don't reallocate" problem.
Allowing unlinks in a group also allows the "valid" bitmap to be updated
when the bits are cleared, so this is beneficial to the end goal of getting
an all-valid block bitmap.  We could even get more fancy and allow blocks
marked valid to be used for allocations, but that is more complex than I like.

> One thing that has worked very well for us is that we keep a digital 
> signature of each user object (MD5, SHAX hash, etc) so we can validate 
> that what we wrote is what got read back.  This also provides a very 
> powerful sanity check after getting hit by failing media or severe file 
> system corruption since what ever we do manage to salvage (which might 
> not be all files) can be validated.

Yes, we've looked at this also for Lustre (we can already do checksums
from the client memory down to the server disk), but the problem of
consistency in the face of write/truncate/append and a crash is complex.
There's also the issue of whether to do partial-file checksums (in order
to allow more efficient updates) or full-file checksums.

I believe at one point there was work on a checksum loop device, but this
also has potential consistency problems in the face of a crash.

> For general purpose read/write work loads, I wonder if it would make 
> sense to compute and store such a checksum or signature on close (say in 
> an extended attribute)?  It might be useful to use another of those 
> special attributes (like immutable attribute) to indicate that this file 
> is important enough to digitally sign on close.

Hmm, good idea.  If a file is immutable that makes it fairly certain it
won't be modified any time soon so a good candidate for checksumming.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html