On Sun, Oct 20, 2024 at 02:21:50PM -0700, Linus Torvalds wrote: > There's a very real reason many places don't use filesystems that do > fsck any more. So fsck has been on my mind a lot lately - seems like all I'm working on lately is fsck related things - incidentally, "filesystems that don't need fsck" is a myth. The reason is that filesystems have mutable global state, and that state has to be consistent for the filesystem to work correctly: at a minumum, global usage counters that have to be correct for -ENOSPC to work, and allocation/free space maps that have to be correct to not double allocate. The only way out of this is to do a pure logging filesystem, i.e. nilfs, and there's a reason those never took off - compacting overhead is too high, they don't scale with real world usage (and you still have to give up on posix -ENOSPC, not that that's any real loss). Even in distributed land, they may get away without a traditional precise fsck, but they get away with that by leaving the heavy lifting (precise allocation information) to something like a traditional local filesystem that does have that. And they still need, at a minumum, a global GC operation - but GC is just a toy version of fsck to the filesystem developer; it'll have the same algorithmic complexity as traditional fsck but without having to be precise. (Incidentally, the main check allocations fsck pass in bcachefs is directly descended from the runtime GC code in bcache, even if barely recognizable now). And a filesystem needs to be able to cope with extreme damage to be considered fit for purpose - we need to degrade gracefully if there's corruption, not tell the user "oops, your filesystem is inaccessible" if something got scribbled over. I consider it flatly inacceptable to not be able to recover a filesystem if there's data on it. If you blew away the superblock and all the backup superblocks by running mkfs, /that's/ pretty much unrecoverable because there's too much in the superblock we really need, but literally anything else we should be able to recover from - and automatically is the goal. So there's a lot of interesting challanges in fsck. - Scaling: fsck is pretty much the limiting factor on filesystem scalability, If it wasn't for fsck bcachefs would probably scale up to an exabyte fairly trivially. Making fsck scale to exabyte range filesystems is going to take a _lot_ of clever sharding and clever algorithms. - Continuing to run gracefully in the presence of damage wherever possible, instead of forcing fsck to be run right away. If allocation info is corrupt in the wrong ways such that we might double allocate, that's a problem, or if interior btree nodes are toast that requires expensive repair, but we should be able to continue running with most other types of corruption. That hasn't been the priority while in development - in development, we want to fail fast and noisily so that bugs are reported and the filesystem is left in a state where we can see what happened - but this is an area I'm starting to work on now. - Making sure that fsck never makes things worse You really don't want fsck to ever delete anything, this could be absolutely tragic in the event of any sort of transient error (or bug). We've still got a bit of work to do here with pointers to indirect extents, and there's probably some other cases that need to be looked at - I think XFS is ahead of bcachefs here, I know Darrick has a notion of "tainted" metadata, the idea being that if a pointer to an indirect extent points to a missing extent, we don't delete it, we just flag it as tainted: don't log more fsck errors, just return -EIO when reading from it; then if we're able to recover the indirect extent later we can just clear the tainted flag. We've got some fun tricks for getting back online as quickly as possible even in the event of catastrophic damage. If alloc info is suspect, we can do very quick pass that walks all pointers and just marks a "bucket is currently allocated, don't use" bitmap and defer repairing or rebuilding the actual alloc info until we're online, in the background. And if interior btree nodes are toast and we need to scan (which shouldn't ever happen, but users are users and hardware is hardware, and I haven't done btrfs dup style replication because you can't trust SSDs to lay writes out on different erase units) there's a bitmap in the superblock of ranges that have btree nodes so the scan pass on a modern filesystem shouldn't take too long.