On Sat, Apr 28, 2007 at 05:05:22PM -0500, Matt Mackall wrote: > This is a relatively simple scheme for making a filesystem with > incremental online consistency checks of both data and metadata. > Overhead can be well under 1% disk space and CPU overhead may also be > very small, while greatly improving filesystem integrity. What's your goal here? Is it to it to speed up fsck's after an unclean shutdown to the point so that you don't need to use a journal or some kind of soft updates scheme? Is it to speed up fsck's after the kernel has detected some kind if internal consistency error? Is it to speed up fsck's after you no longer have confidence that random blocks in the filesystem may have gotten corrupted, due to a suspend/resume bug, hard drive failure, reported CRC/parity errors when writing to the device, reports of massive ECC mailures in your memory that could have caused random block to have gotten been written with multiple bit flips? The first is relatively easy, but as you move down the list, things get progressive harder, since it's no longer possible to use a per tile "clean bit" to assume that you get to skip checking that particular tile or chunk. > Divide disk into a bunch of tiles. For each tile, allocate a one > block tile header that contains (inode, checksum) pairs for each > block in the tile. Unused blocks get marked inode -1, filesystem > metadata blocks -2. The first element contains a last-clean > timestamp, a clean flag and a checksum for the block itself. For 4K > blocks with 32-bit inode and CRC, that's 512 blocks per tile (2MB), > with ~.2% overhead. So what happens for files that are bigger than 2MB? Presumably they consists of blocks that must come from more than one tile, right? So is an inode allowed to reference blocks outside of its tile? Or does an inode that needs to span multiple tiles have a local sub-inode in each tile, with back pointers to parent inode? Note that both design paths have some serious tradeoffs. If you allow an inode to span multiple tiles, now you can't check the block allocation data structures without scanning all of the tiles involved. If you have sub-inodes, now you have to have bidirectional pointers and you have to validate those pointers after validating all of the individual tiles. This is one of the toughest aspects of either the chunkfs or tilefs design, and when we discussed chunkfs at past filesystem workshops, we didn't come to any firm conclusions about the best way to solve it, except to acknowledge that it is a hard problem. My personal inclination is to use substantially bigger chunks than the 2MB that you've proposed, and make each of the chunks more like 2 or 4 GB each, and to enforce a rule which says an inode in a chunk can only reference blocks in that local chunk, and to try like mad to keep directories reference inodes in a chunk, and to keep inodes reference blocks within that chunk. When a file is bigger than a chunk, then you will be forced to use indirection pointers that basically say, "for offsets 2GB-4GB, reference inode XXXX in chunk YYYY, and for 4GB-8GB, checkout inode WWWW in chunk ZZZZ, etc". I won't say that this is definitely the best way to do things, but I note that you haven't really address this design point, and there are no obvious best ways of handling this. > [Note that CRCs are optional so we can cut the overhead in half. I > choose CRCs here because they're capable of catching the vast > majority of accidental corruptions at a small cost and mostly serve > to protect against errors not caught by on-disk ECC (eg cable noise, > kernel bugs, cosmic rays). Replacing CRCs with a stronger hash like > SHA-n is perfectly doable.] If the goal is just accidental corruptions, CRC's are just fine. If you want better protection against accidental corruption, then the answer is to use a bigger CRC. Using a cryptographic hash like SHA-n is pure overkill unless you're trying to design protection against a malicious attacker, in which case you've got a much bigger set of problems that you have to address first --- you don't get a cryptogaphically secure filesystem by replcing a CRC with a SHA-n hash function.... > Every time we write to a tile, we must mark the tile dirty. To cut > down time to find dirty tiles, the clean bits can be collected into a > smaller set of blocks, one clean bitmap block per 64GB of data. Hopefully the clean bitmap block is protected by a checksum. After all, the smaller set of clean bitmap block is going to be constantly updated as tiles get dirtied, and then cleaned. What if they get corrupted? How does the checker notice? And presumably if there is a CRC that doesn't verify, it would have to check all of the tiles, right? > Checking a tile: > > Read the tile > If clean and current, we're done. > Check the tile header checksum > Check the checksum on each block in the tile > Check that metadata blocks are metadata > Check that inodes in tile agree with inode bitmap > Check that data blocks in tile agree with block bitmap This looks like an inode can only reference blocks in the local tile; is that your assumption? In which case the question of how you handle inodes greater than your tile size is really important. > Check for each inode covering tile according to header: > Marked in-use > Blocks are owned by the inode > Blocks incorrectly owned put on list for tree sweep or orphaned > Block count is right > Traverse backpointer linked list, check link count and checksum > Inodes with incorrect refcount on tree sweep list or orphaned > If directory: > Points to in-use inodes > Mark tile clean > > If all tiles are simultaneously clean, fs is clean. > There are a number of filesystem corruptions this algorithm won't catch. The most obvious is one where the directory tree isn't really a tree, but an cyclic graph. What if you have something like this: A <----+ / \ | B C ^ / | D----->----+ That is, what if D's parent is B, and B's parent is A, and A's parent is... D? Assume for the sake of argument that each inode, A, B, C, D, are in separate tiles. In theory this shouldn't happen except if there are bugs in the filesystem, so if your tile checking algorithm is assuming no bugs, or no malicious modifications of the filesystem using a debugfs-like program, maybe you don't consider that a bug in your checking algorithm. This goes back to my question of what are your design goals centered around? > Performance implications: > > Keeping the average read-mostly filesystem fully-checked should be easy For a write-mostly fliesystem --- there will likely be a large number of tiles that will be left dirty. In addition, there will be some very nasty performance implications in that in order for this algorithm to work, no blocks involving a clean tile can be written until the dirty bit is set. So for a benchmark that heavily features the use of fysnc(), such as postmark, we can expect that this filesystem may not do so well. That may be OK if that's outside of the design envelope of this filesystem --- is this meant to be only used for read-mostly filesystems, or is it intended to be a general-purpose filesystem that might be used on a fileserver or a mail server? - Ted - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html