[bcachefs] self healing design doc

Kent Overstreet <kent.overstreet@xxxxxxxxx> · Sat, 21 Sep 2024 19:02:53 -0400

So, I'm sketching out self healing for bcachefs - that is, repairing
errors automatically, instead of requiring the user to run fsck
manually.

This can be divided up into two different categories, or strategies:

 - Repairing errors/damage as they are noticed in normal operation: i.e.
   follow a backpointer to an extent, notice that there is no extent,
   and then simply delete the backpointer and continue

 - Flagging in the superblock that either there's an unfixed error, or
   that a fsck pass is required, and then running it automatically
   either on next mount or scheduling it for some later time

The first is going to be a big focus in the future, as on larger
filesystems we _really_ want to avoid running full fsck passes unless
absolutely required.

For now though, getting the second mode implemented is higher priority;
we need that so that users aren't having to jump through hoops in order
to get their filesystem working if their root filesystem encounters
corruption - i.e. this is needed before we can take the EXPERIMENTAL
label off.

(I recently had to dig out my nixos recovery usb stick to recovery from
the bug where online fsck was deleting inodes that were unlinked but
still in use - whoops, don't want normal users to have to do that).

Background, things we already have:

- Recovery passes are enumerated, with stable identifiers. This is used
  for upgrades and downgrades: upgrades and downgrades may specify
  recovery passes to run and errors to silently fix, and those are
  listed in the superblock until complete - in case of an interrupted
  upgrade/downgrade.

- fsck errors are also enumerated. This is currently used by the
  superblock 'errors' section, which lists counts and date of last error
  of every error the filesystem has ever seen. This section is purely
  informational (it's highly useful in bug reports) - it doesn't (yet?)
  have fields for whether a given error type has unfixed errors.

Todo items:

- Convert 'bch2_fs_inconsistent()' calls to fsck_err() calls.

  bch2_fs_inconsistent() just goes emergency read-only (or panics, or
  does nothing, according to options). fsck_err() logs the error (by
  type) in the superblock, and returns true/false/error if we should fix
  the error, just continue, or shut down.

  One of the goals here is that any time there's a serious error that
  causes us to go ERO/offline or needs repair, it should be logged in
  the superblock.

  I'm also hoping to get an opt-in telemetry tool written to upload
  superblocks once a week (a bit like debian popcon); since many users
  don't report bugs if they can work around them, this will give us some
  valuable info on how buggy or not buggy bcachefs is in the wild, and
  where to hunt for bugs.

- Add a field to BCH_SB_ERRS() in sb-errors_format.h for which recovery
  pass(es?) are required to fix each error.

New superblock fields for self healing:

The existing sb_ext.recovery_passes_required field that is used for
upgrades/downgrades probably isn't what we want here - some errors need
to be fixed right away, for others we just want to schedule fsck for at
some point in the future.

Q: What determines which errors need to be fixed right away, and should
this get a bit in the superblock? Or is it static per-error-type?

Not sure on this one yet.

Q: In the superblock, should we be listing
 A: unfixed errors, or
 B: recovery passes we need to run (immediately or when scheduled), or
 C: perhaps both

I think we'll be going with option A, which means we can just add a bit
or two to the sb_errors superblock section - this works provided the
sb_err -> recovery passing mapping is static, and I believe the sb_err
enum is fine grained enough that it is.

Once I've added the 'recovery passes to repair' field to BCH_SB_ERRS()
I'll have a better feeling on this.

Thoughts? Corner cases?