On Sun, Apr 29, 2007 at 07:23:49PM -0400, Theodore Tso wrote: > On Sat, Apr 28, 2007 at 05:05:22PM -0500, Matt Mackall wrote: > > This is a relatively simple scheme for making a filesystem with > > incremental online consistency checks of both data and metadata. > > Overhead can be well under 1% disk space and CPU overhead may also be > > very small, while greatly improving filesystem integrity. > > What's your goal here? Is it to it to speed up fsck's after an > unclean shutdown to the point so that you don't need to use a journal > or some kind of soft updates scheme? > > Is it to speed up fsck's after the kernel has detected some kind if > internal consistency error? > > Is it to speed up fsck's after you no longer have confidence that > random blocks in the filesystem may have gotten corrupted, due to a > suspend/resume bug, hard drive failure, reported CRC/parity errors > when writing to the device, reports of massive ECC mailures in your > memory that could have caused random block to have gotten been written > with multiple bit flips? Mostly this last. > The first is relatively easy, but as you move down the list, things > get progressive harder, since it's no longer possible to use a per > tile "clean bit" to assume that you get to skip checking that > particular tile or chunk. Tile headers contain a timestamp that lets you say "let's revisit every clean block that hasn't been checked in the last 6 months". And the blocks containing the clean bits are themselves covered by CRC! > > Divide disk into a bunch of tiles. For each tile, allocate a one > > block tile header that contains (inode, checksum) pairs for each > > block in the tile. Unused blocks get marked inode -1, filesystem > > metadata blocks -2. The first element contains a last-clean > > timestamp, a clean flag and a checksum for the block itself. For 4K > > blocks with 32-bit inode and CRC, that's 512 blocks per tile (2MB), > > with ~.2% overhead. > > So what happens for files that are bigger than 2MB? Presumably they > consists of blocks that must come from more than one tile, right? So > is an inode allowed to reference blocks outside of its tile? Or does > an inode that needs to span multiple tiles have a local sub-inode in > each tile, with back pointers to parent inode? We don't need any form of sub-inode or continuation inode, which is why it's not mentioned. This is one of two main differences from chunkfs. The other is reverse maps (aka back pointers) for blocks -> inodes and inodes -> directories that obviate the need to have large amounts of memory to check for collisions. When you check a tile, you check all the inodes that "cover" that tile (ie have overlapping data or metadata). So if a large file spans many tiles, you end up reading many tiles to check one. So a completely naive checker would check the same file many times in the course of checking a file. Total work would then be multiplied by the average tile-crossing rate for files. However most of this work can be trivially avoided, of course. You can keep a small in-memory LRU cache of recently-checked inodes to avoid having to repeatedly check files in neighboring blocks as your scan progresses. This LRU doesn't need to be very big at all to be quite effective. This does mean that our time to make progress on a check is bounded at the top by the size of our largest file. If we have a degenerate filesystem filled with a single file, this will in fact take as long as a conventional fsck. If your filesystem has, say, 100 roughly equally-sized files, you're back in Chunkfs territory. Note that even though our time usage scales with max file size, our memory usage is bound by a small constant. We only need one tile header in memory plus what it takes to walk an inode's data. The block->inode reverse map means that we don't have keep a filesystem- or chunk-sized map to spot block collisions. Each block knows where it belongs and spotting broken links is easy. So we should have no trouble checking an exabyte-sized filesystem on a 4MB box. Even if it has one exabyte-sized file! We check the first tile, see that it points to our file, then iterate through that file, checking that the forward and reverse pointers for each block match and all CRCs match, etc. We cache the file's inode as clean, finish checking anything else in the first tile, then mark it clean. When we get to the next tile (and the next billion after that!), we notice that each block points back to our cached inode and skip rechecking it. > > Every time we write to a tile, we must mark the tile dirty. To cut > > down time to find dirty tiles, the clean bits can be collected into a > > smaller set of blocks, one clean bitmap block per 64GB of data. > > Hopefully the clean bitmap block is protected by a checksum. After > all, the smaller set of clean bitmap block is going to be constantly > updated as tiles get dirtied, and then cleaned. What if they get > corrupted? How does the checker notice? And presumably if there is a > CRC that doesn't verify, it would have to check all of the tiles, > right? All blocks are protected by (still optional!) checksums. In an ext2/3-like filesystem, this includes superblocks, inodes, inode and block bitmaps, the journal(!), and the tile headers themselves. > There are a number of filesystem corruptions this algorithm won't > catch. The most obvious is one where the directory tree isn't really > a tree, but an cyclic graph. What if you have something like this: > > A <----+ > / \ | > B C ^ > / | > D----->----+ > > That is, what if D's parent is B, and B's parent is A, and A's parent > is... D? Assume for the sake of argument that each inode, A, B, C, D, > are in separate tiles. >From the original message: Inodes have a backpointer to a directory that links them. Hardlinked files have two extra inode pointers in the directory structure, to the previous and next directories containing the link. Hardlinked inodes have a checksum of the members of that list. When we check directory D, D's inode has a backpointer (which had better match ".." in the directory itself if we keep that redundancy!). If we can follow this back to root (using a standard two-pointer cycle detection algorith), we have no cycle. As we also check that every inode pointed to by directory D also points back to D, any deviation from a valid tree can be detected. And again, a small cache of inodes known to be properly rooted will save a lot of checks. > > Performance implications: > > > > Keeping the average read-mostly filesystem fully-checked should be easy > > For a write-mostly fliesystem --- there will likely be a large number > of tiles that will be left dirty. In addition, there will be some > very nasty performance implications in that in order for this > algorithm to work, no blocks involving a clean tile can be written > until the dirty bit is set. So for a benchmark that heavily features > the use of fysnc(), such as postmark, we can expect that this > filesystem may not do so well. That may be OK if that's outside of > the design envelope of this filesystem --- is this meant to be only > used for read-mostly filesystems, or is it intended to be a > general-purpose filesystem that might be used on a fileserver or a > mail server? I don't think that's so problematic. In the postmark case, we usually won't need to flush out the dirty bit, as it'll already be set. Can we get into a situation where we can never finish checking the filesystem online because it's constantly being dirtied? Definitely. I don't think we can avoid having to pass through a quiescent period of some duration. But with small tiles, the size of the duration period is roughly proportional to the size of max(largest file touched, total data touched). -- Mathematics is the supreme nostalgia of our time. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html