This is a relatively simple scheme for making a filesystem with incremental online consistency checks of both data and metadata. Overhead can be well under 1% disk space and CPU overhead may also be very small, while greatly improving filesystem integrity. Some things we need to check during fsck: all directories point to in-use inodes all in-use inodes are referred to by directories all inodes in use are marked in use all free inodes are marked free all inodes point to in-use blocks all inode refcounts are correct all inode block counts are correct free inode count is correct no blocks are used twice all used blocks are marked as used all free blocks are marked as free optional: all block contents are correct Layout: Start with a conventional ext2/3-like filesystem. Divide disk into a bunch of tiles. For each tile, allocate a one block tile header that contains (inode, checksum) pairs for each block in the tile. Unused blocks get marked inode -1, filesystem metadata blocks -2. The first element contains a last-clean timestamp, a clean flag and a checksum for the block itself. For 4K blocks with 32-bit inode and CRC, that's 512 blocks per tile (2MB), with ~.2% overhead. [Note that CRCs are optional so we can cut the overhead in half. I choose CRCs here because they're capable of catching the vast majority of accidental corruptions at a small cost and mostly serve to protect against errors not caught by on-disk ECC (eg cable noise, kernel bugs, cosmic rays). Replacing CRCs with a stronger hash like SHA-n is perfectly doable.] Every time we write to a tile, we must mark the tile dirty. To cut down time to find dirty tiles, the clean bits can be collected into a smaller set of blocks, one clean bitmap block per 64GB of data. Inodes have a backpointer to a directory that links them. Hardlinked files have two extra inode pointers in the directory structure, to the previous and next directories containing the link. Hardlinked inodes have a checksum of the members of that list. Checking a tile: Read the tile If clean and current, we're done. Check the tile header checksum Check the checksum on each block in the tile Check that metadata blocks are metadata Check that inodes in tile agree with inode bitmap Check that data blocks in tile agree with block bitmap Check for each inode covering tile according to header: Marked in-use Blocks are owned by the inode Blocks incorrectly owned put on list for tree sweep or orphaned Block count is right Traverse backpointer linked list, check link count and checksum Inodes with incorrect refcount on tree sweep list or orphaned If directory: Points to in-use inodes Mark tile clean If all tiles are simultaneously clean, fs is clean. Tree sweeps and orphans: For damaged backpointers, we may want to walk the directory tree to repair the links. Memory usage here is bounded by the size of the orphan list and we may be able to trim the walk when examining files or directories inside of clean tiles. Alternately, we can move orphaned blocks and inodes to lost+found/ and forgo the more expensive tree walk. Performance implications: Amount of RAM needed to check an FS is tiny Tile headers are very near their blocks, so usually don't require a seek Caching overhead is relatively small Most actual tile header writes can be folded in the journal Hardlinks are a bit of a pain, but somewhat rare We can cache checks of inodes covering multiple tiles Huge file deletes have to hit even more blocks than ext2/3 It's antithetical to using extents, as is any system with block CRCs Keeping the average read-mostly filesystem fully-checked should be easy -- Mathematics is the supreme nostalgia of our time. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html