Re: [RFC] TileFS - a proposal for scalable integrity checking

Theodore Tso <tytso@xxxxxxx> · Sun, 29 Apr 2007 19:23:49 -0400

On Sat, Apr 28, 2007 at 05:05:22PM -0500, Matt Mackall wrote:
> This is a relatively simple scheme for making a filesystem with
> incremental online consistency checks of both data and metadata.
> Overhead can be well under 1% disk space and CPU overhead may also be
> very small, while greatly improving filesystem integrity.

What's your goal here?  Is it to it to speed up fsck's after an
unclean shutdown to the point so that you don't need to use a journal
or some kind of soft updates scheme?   

Is it to speed up fsck's after the kernel has detected some kind if
internal consistency error?  

Is it to speed up fsck's after you no longer have confidence that
random blocks in the filesystem may have gotten corrupted, due to a
suspend/resume bug, hard drive failure, reported CRC/parity errors
when writing to the device, reports of massive ECC mailures in your
memory that could have caused random block to have gotten been written
with multiple bit flips?

The first is relatively easy, but as you move down the list, things
get progressive harder, since it's no longer possible to use a per
tile "clean bit" to assume that you get to skip checking that
particular tile or chunk.

>  Divide disk into a bunch of tiles. For each tile, allocate a one
>  block tile header that contains (inode, checksum) pairs for each
>  block in the tile. Unused blocks get marked inode -1, filesystem
>  metadata blocks -2. The first element contains a last-clean
>  timestamp, a clean flag and a checksum for the block itself. For 4K
>  blocks with 32-bit inode and CRC, that's 512 blocks per tile (2MB),
>  with ~.2% overhead.

So what happens for files that are bigger than 2MB?  Presumably they
consists of blocks that must come from more than one tile, right?  So
is an inode allowed to reference blocks outside of its tile?  Or does
an inode that needs to span multiple tiles have a local sub-inode in
each tile, with back pointers to parent inode?   

Note that both design paths have some serious tradeoffs.  If you allow
an inode to span multiple tiles, now you can't check the block
allocation data structures without scanning all of the tiles involved.
If you have sub-inodes, now you have to have bidirectional pointers
and you have to validate those pointers after validating all of the
individual tiles.  

This is one of the toughest aspects of either the chunkfs or tilefs
design, and when we discussed chunkfs at past filesystem workshops, we
didn't come to any firm conclusions about the best way to solve it,
except to acknowledge that it is a hard problem.  My personal
inclination is to use substantially bigger chunks than the 2MB that
you've proposed, and make each of the chunks more like 2 or 4 GB each,
and to enforce a rule which says an inode in a chunk can only
reference blocks in that local chunk, and to try like mad to keep
directories reference inodes in a chunk, and to keep inodes reference
blocks within that chunk.  When a file is bigger than a chunk, then
you will be forced to use indirection pointers that basically say,
"for offsets 2GB-4GB, reference inode XXXX in chunk YYYY, and for
4GB-8GB, checkout inode WWWW in chunk ZZZZ, etc".   

I won't say that this is definitely the best way to do things, but I
note that you haven't really address this design point, and there are
no obvious best ways of handling this.

>  [Note that CRCs are optional so we can cut the overhead in half. I
>  choose CRCs here because they're capable of catching the vast
>  majority of accidental corruptions at a small cost and mostly serve
>  to protect against errors not caught by on-disk ECC (eg cable noise,
>  kernel bugs, cosmic rays). Replacing CRCs with a stronger hash like
>  SHA-n is perfectly doable.]

If the goal is just accidental corruptions, CRC's are just fine.  If
you want better protection against accidental corruption, then the
answer is to use a bigger CRC.  Using a cryptographic hash like SHA-n
is pure overkill unless you're trying to design protection against a
malicious attacker, in which case you've got a much bigger set of
problems that you have to address first --- you don't get a
cryptogaphically secure filesystem by replcing a CRC with a SHA-n hash
function....

>  Every time we write to a tile, we must mark the tile dirty. To cut
>  down time to find dirty tiles, the clean bits can be collected into a
>  smaller set of blocks, one clean bitmap block per 64GB of data.

Hopefully the clean bitmap block is protected by a checksum.  After
all, the smaller set of clean bitmap block is going to be constantly
updated as tiles get dirtied, and then cleaned.  What if they get
corrupted?  How does the checker notice?  And presumably if there is a
CRC that doesn't verify, it would have to check all of the tiles,
right?

> Checking a tile:
> 
>  Read the tile
>  If clean and current, we're done.
>  Check the tile header checksum
>  Check the checksum on each block in the tile
>  Check that metadata blocks are metadata
>  Check that inodes in tile agree with inode bitmap
>  Check that data blocks in tile agree with block bitmap

This looks like an inode can only reference blocks in the local tile;
is that your assumption?  In which case the question of how you handle
inodes greater than your tile size is really important.

>  Check for each inode covering tile according to header:
>   Marked in-use
>   Blocks are owned by the inode
>     Blocks incorrectly owned put on list for tree sweep or orphaned
>   Block count is right
>   Traverse backpointer linked list, check link count and checksum
>     Inodes with incorrect refcount on tree sweep list or orphaned
>   If directory:
>     Points to in-use inodes
>  Mark tile clean
> 
>  If all tiles are simultaneously clean, fs is clean.
> 

There are a number of filesystem corruptions this algorithm won't
catch.  The most obvious is one where the directory tree isn't really
a tree, but an cyclic graph.   What if you have something like this:

				A <----+
			       / \     |
			      B   C    ^
			     /         |
  			    D----->----+

That is, what if D's parent is B, and B's parent is A, and A's parent
is... D?  Assume for the sake of argument that each inode, A, B, C, D,
are in separate tiles.

In theory this shouldn't happen except if there are bugs in the
filesystem, so if your tile checking algorithm is assuming no bugs, or
no malicious modifications of the filesystem using a debugfs-like
program, maybe you don't consider that a bug in your checking
algorithm.  This goes back to my question of what are your design
goals centered around?

> Performance implications:
>
>  Keeping the average read-mostly filesystem fully-checked should be easy

For a write-mostly fliesystem --- there will likely be a large number
of tiles that will be left dirty.  In addition, there will be some
very nasty performance implications in that in order for this
algorithm to work, no blocks involving a clean tile can be written
until the dirty bit is set.  So for a benchmark that heavily features
the use of fysnc(), such as postmark, we can expect that this
filesystem may not do so well.  That may be OK if that's outside of
the design envelope of this filesystem --- is this meant to be only
used for read-mostly filesystems, or is it intended to be a
general-purpose filesystem that might be used on a fileserver or a
mail server?

					- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html