Re: [RFC] TileFS - a proposal for scalable integrity checking

Matt Mackall <mpm@xxxxxxxxxxx> · Sun, 29 Apr 2007 20:40:42 -0500

On Sun, Apr 29, 2007 at 07:23:49PM -0400, Theodore Tso wrote:
> On Sat, Apr 28, 2007 at 05:05:22PM -0500, Matt Mackall wrote:
> > This is a relatively simple scheme for making a filesystem with
> > incremental online consistency checks of both data and metadata.
> > Overhead can be well under 1% disk space and CPU overhead may also be
> > very small, while greatly improving filesystem integrity.
> 
> What's your goal here?  Is it to it to speed up fsck's after an
> unclean shutdown to the point so that you don't need to use a journal
> or some kind of soft updates scheme?   
> 
> Is it to speed up fsck's after the kernel has detected some kind if
> internal consistency error?  
> 
> Is it to speed up fsck's after you no longer have confidence that
> random blocks in the filesystem may have gotten corrupted, due to a
> suspend/resume bug, hard drive failure, reported CRC/parity errors
> when writing to the device, reports of massive ECC mailures in your
> memory that could have caused random block to have gotten been written
> with multiple bit flips?

Mostly this last.

> The first is relatively easy, but as you move down the list, things
> get progressive harder, since it's no longer possible to use a per
> tile "clean bit" to assume that you get to skip checking that
> particular tile or chunk.

Tile headers contain a timestamp that lets you say "let's revisit
every clean block that hasn't been checked in the last 6 months". And
the blocks containing the clean bits are themselves covered by CRC!

> >  Divide disk into a bunch of tiles. For each tile, allocate a one
> >  block tile header that contains (inode, checksum) pairs for each
> >  block in the tile. Unused blocks get marked inode -1, filesystem
> >  metadata blocks -2. The first element contains a last-clean
> >  timestamp, a clean flag and a checksum for the block itself. For 4K
> >  blocks with 32-bit inode and CRC, that's 512 blocks per tile (2MB),
> >  with ~.2% overhead.
> 
> So what happens for files that are bigger than 2MB?  Presumably they
> consists of blocks that must come from more than one tile, right?  So
> is an inode allowed to reference blocks outside of its tile?  Or does
> an inode that needs to span multiple tiles have a local sub-inode in
> each tile, with back pointers to parent inode?   

We don't need any form of sub-inode or continuation inode, which is
why it's not mentioned. This is one of two main differences from
chunkfs. The other is reverse maps (aka back pointers) for blocks ->
inodes and inodes -> directories that obviate the need to have large
amounts of memory to check for collisions.

When you check a tile, you check all the inodes that "cover" that tile
(ie have overlapping data or metadata). So if a large file spans many
tiles, you end up reading many tiles to check one. So a completely
naive checker would check the same file many times in the course of
checking a file. Total work would then be multiplied by the average
tile-crossing rate for files.

However most of this work can be trivially avoided, of course. You can
keep a small in-memory LRU cache of recently-checked inodes to avoid
having to repeatedly check files in neighboring blocks as your scan
progresses. This LRU doesn't need to be very big at all to be quite
effective.

This does mean that our time to make progress on a check is bounded at
the top by the size of our largest file. If we have a degenerate
filesystem filled with a single file, this will in fact take as long
as a conventional fsck. If your filesystem has, say, 100 roughly
equally-sized files, you're back in Chunkfs territory.

Note that even though our time usage scales with max file size, our
memory usage is bound by a small constant. We only need one tile
header in memory plus what it takes to walk an inode's data. The
block->inode reverse map means that we don't have keep a filesystem-
or chunk-sized map to spot block collisions. Each block knows where it
belongs and spotting broken links is easy.

So we should have no trouble checking an exabyte-sized filesystem on a
4MB box. Even if it has one exabyte-sized file! We check the first
tile, see that it points to our file, then iterate through that file,
checking that the forward and reverse pointers for each block match
and all CRCs match, etc. We cache the file's inode as clean, finish
checking anything else in the first tile, then mark it clean. When we get
to the next tile (and the next billion after that!), we notice that
each block points back to our cached inode and skip rechecking it.

> >  Every time we write to a tile, we must mark the tile dirty. To cut
> >  down time to find dirty tiles, the clean bits can be collected into a
> >  smaller set of blocks, one clean bitmap block per 64GB of data.
> 
> Hopefully the clean bitmap block is protected by a checksum.  After
> all, the smaller set of clean bitmap block is going to be constantly
> updated as tiles get dirtied, and then cleaned.  What if they get
> corrupted?  How does the checker notice?  And presumably if there is a
> CRC that doesn't verify, it would have to check all of the tiles,
> right?

All blocks are protected by (still optional!) checksums. In an
ext2/3-like filesystem, this includes superblocks, inodes, inode and
block bitmaps, the journal(!), and the tile headers themselves.

> There are a number of filesystem corruptions this algorithm won't
> catch.  The most obvious is one where the directory tree isn't really
> a tree, but an cyclic graph.   What if you have something like this:
> 
> 				A <----+
> 			       / \     |
> 			      B   C    ^
> 			     /         |
>   			    D----->----+
> 
> That is, what if D's parent is B, and B's parent is A, and A's parent
> is... D?  Assume for the sake of argument that each inode, A, B, C, D,
> are in separate tiles.

>From the original message:

 Inodes have a backpointer to a directory that links them. Hardlinked
 files have two extra inode pointers in the directory structure, to
 the previous and next directories containing the link. Hardlinked
 inodes have a checksum of the members of that list.

When we check directory D, D's inode has a backpointer (which had
better match ".." in the directory itself if we keep that
redundancy!). If we can follow this back to root (using a standard
two-pointer cycle detection algorith), we have no cycle. As we also
check that every inode pointed to by directory D also points back to
D, any deviation from a valid tree can be detected.

And again, a small cache of inodes known to be properly rooted will
save a lot of checks.

> > Performance implications:
> >
> >  Keeping the average read-mostly filesystem fully-checked should be easy
> 
> For a write-mostly fliesystem --- there will likely be a large number
> of tiles that will be left dirty.  In addition, there will be some
> very nasty performance implications in that in order for this
> algorithm to work, no blocks involving a clean tile can be written
> until the dirty bit is set.  So for a benchmark that heavily features
> the use of fysnc(), such as postmark, we can expect that this
> filesystem may not do so well.  That may be OK if that's outside of
> the design envelope of this filesystem --- is this meant to be only
> used for read-mostly filesystems, or is it intended to be a
> general-purpose filesystem that might be used on a fileserver or a
> mail server?

I don't think that's so problematic. In the postmark case, we usually
won't need to flush out the dirty bit, as it'll already be set.

Can we get into a situation where we can never finish checking the
filesystem online because it's constantly being dirtied? Definitely. I
don't think we can avoid having to pass through a quiescent period of
some duration.

But with small tiles, the size of the duration period is roughly
proportional to the size of max(largest file touched, total data touched).

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html