Re: [PATCH] mm: Drop INT_MAX limit from kvmalloc()

Kent Overstreet <kent.overstreet@xxxxxxxxx> · Sun, 27 Oct 2024 15:58:11 -0400

On Sun, Oct 20, 2024 at 02:21:50PM -0700, Linus Torvalds wrote:
> There's a very real reason many places don't use filesystems that do
> fsck any more.

So fsck has been on my mind a lot lately - seems like all I'm working on
lately is fsck related things - incidentally, "filesystems that don't
need fsck" is a myth.

The reason is that filesystems have mutable global state, and that state
has to be consistent for the filesystem to work correctly: at a minumum,
global usage counters that have to be correct for -ENOSPC to work, and
allocation/free space maps that have to be correct to not double
allocate.

The only way out of this is to do a pure logging filesystem, i.e. nilfs,
and there's a reason those never took off - compacting overhead is too
high, they don't scale with real world usage (and you still have to give
up on posix -ENOSPC, not that that's any real loss).

Even in distributed land, they may get away without a traditional
precise fsck, but they get away with that by leaving the heavy lifting
(precise allocation information) to something like a traditional local
filesystem that does have that. And they still need, at a minumum, a
global GC operation - but GC is just a toy version of fsck to the
filesystem developer; it'll have the same algorithmic complexity as
traditional fsck but without having to be precise.

(Incidentally, the main check allocations fsck pass in bcachefs is
directly descended from the runtime GC code in bcache, even if barely
recognizable now).

And a filesystem needs to be able to cope with extreme damage to be
considered fit for purpose - we need to degrade gracefully if there's
corruption, not tell the user "oops, your filesystem is inaccessible" if
something got scribbled over. I consider it flatly inacceptable to not
be able to recover a filesystem if there's data on it. If you blew away
the superblock and all the backup superblocks by running mkfs, /that's/
pretty much unrecoverable because there's too much in the superblock we
really need, but literally anything else we should be able to recover
from - and automatically is the goal.

So there's a lot of interesting challanges in fsck.

- Scaling: fsck is pretty much the limiting factor on filesystem
  scalability, If it wasn't for fsck bcachefs would probably scale up to
  an exabyte fairly trivially. Making fsck scale to exabyte range
  filesystems is going to take a _lot_ of clever sharding and clever
  algorithms.

- Continuing to run gracefully in the presence of damage wherever
  possible, instead of forcing fsck to be run right away.

  If allocation info is corrupt in the wrong ways such that we might
  double allocate, that's a problem, or if interior btree nodes are
  toast that requires expensive repair, but we should be able to
  continue running with most other types of corruption. That hasn't been
  the priority while in development - in development, we want to fail
  fast and noisily so that bugs are reported and the filesystem is left
  in a state where we can see what happened - but this is an area I'm
  starting to work on now.

- Making sure that fsck never makes things worse

  You really don't want fsck to ever delete anything, this could be
  absolutely tragic in the event of any sort of transient error (or
  bug). We've still got a bit of work to do here with pointers to
  indirect extents, and there's probably some other cases that need to
  be looked at - I think XFS is ahead of bcachefs here, I know
  Darrick has a notion of "tainted" metadata, the idea being that if a
  pointer to an indirect extent points to a missing extent, we don't
  delete it, we just flag it as tainted: don't log more fsck errors,
  just return -EIO when reading from it; then if we're able to recover
  the indirect extent later we can just clear the tainted flag.

We've got some fun tricks for getting back online as quickly as possible
even in the event of catastrophic damage. If alloc info is suspect, we
can do very quick pass that walks all pointers and just marks a "bucket
is currently allocated, don't use" bitmap and defer repairing or
rebuilding the actual alloc info until we're online, in the background.
And if interior btree nodes are toast and we need to scan (which
shouldn't ever happen, but users are users and hardware is hardware, and
I haven't done btrfs dup style replication because you can't trust SSDs
to lay writes out on different erase units) there's a bitmap in the
superblock of ranges that have btree nodes so the scan pass on a modern
filesystem shouldn't take too long.