[RFC] TileFS - a proposal for scalable integrity checking

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



This is a relatively simple scheme for making a filesystem with
incremental online consistency checks of both data and metadata.
Overhead can be well under 1% disk space and CPU overhead may also be
very small, while greatly improving filesystem integrity.

Some things we need to check during fsck:

 all directories point to in-use inodes
 all in-use inodes are referred to by directories
 all inodes in use are marked in use
 all free inodes are marked free
 all inodes point to in-use blocks
 all inode refcounts are correct
 all inode block counts are correct
 free inode count is correct

 no blocks are used twice
 all used blocks are marked as used
 all free blocks are marked as free
 optional: all block contents are correct

Layout:

 Start with a conventional ext2/3-like filesystem.

 Divide disk into a bunch of tiles. For each tile, allocate a one
 block tile header that contains (inode, checksum) pairs for each
 block in the tile. Unused blocks get marked inode -1, filesystem
 metadata blocks -2. The first element contains a last-clean
 timestamp, a clean flag and a checksum for the block itself. For 4K
 blocks with 32-bit inode and CRC, that's 512 blocks per tile (2MB),
 with ~.2% overhead.

 [Note that CRCs are optional so we can cut the overhead in half. I
 choose CRCs here because they're capable of catching the vast
 majority of accidental corruptions at a small cost and mostly serve
 to protect against errors not caught by on-disk ECC (eg cable noise,
 kernel bugs, cosmic rays). Replacing CRCs with a stronger hash like
 SHA-n is perfectly doable.]

 Every time we write to a tile, we must mark the tile dirty. To cut
 down time to find dirty tiles, the clean bits can be collected into a
 smaller set of blocks, one clean bitmap block per 64GB of data.

 Inodes have a backpointer to a directory that links them. Hardlinked
 files have two extra inode pointers in the directory structure, to
 the previous and next directories containing the link. Hardlinked
 inodes have a checksum of the members of that list.

Checking a tile:

 Read the tile
 If clean and current, we're done.
 Check the tile header checksum
 Check the checksum on each block in the tile
 Check that metadata blocks are metadata
 Check that inodes in tile agree with inode bitmap
 Check that data blocks in tile agree with block bitmap
 Check for each inode covering tile according to header:
  Marked in-use
  Blocks are owned by the inode
    Blocks incorrectly owned put on list for tree sweep or orphaned
  Block count is right
  Traverse backpointer linked list, check link count and checksum
    Inodes with incorrect refcount on tree sweep list or orphaned
  If directory:
    Points to in-use inodes
 Mark tile clean

 If all tiles are simultaneously clean, fs is clean.

Tree sweeps and orphans:

 For damaged backpointers, we may want to walk the directory tree to
 repair the links. Memory usage here is bounded by the size of the
 orphan list and we may be able to trim the walk when examining files
 or directories inside of clean tiles.

 Alternately, we can move orphaned blocks and inodes to lost+found/
 and forgo the more expensive tree walk.

Performance implications:

 Amount of RAM needed to check an FS is tiny
 Tile headers are very near their blocks, so usually don't require a seek
 Caching overhead is relatively small
 Most actual tile header writes can be folded in the journal
 Hardlinks are a bit of a pain, but somewhat rare
 We can cache checks of inodes covering multiple tiles
 Huge file deletes have to hit even more blocks than ext2/3
 It's antithetical to using extents, as is any system with block CRCs
 Keeping the average read-mostly filesystem fully-checked should be easy

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [Samba]     [Device Mapper]     [CEPH Development]
  Powered by Linux