bcachefs update: New allocator has been merged

Kent Overstreet <kent.overstreet@xxxxxxxxx> · Sun, 13 Mar 2022 20:45:13 -0400

Just finished a big new update: the big allocator rewrite is finished and
merged.

It's a mandatory disk format upgrade; when switching to the new version on an
existing filesystem you'll see it initialize the freespace btree when you mount.

What's changed: we've got some new persistent data structures that replace code
that used to periodically walk all the buckets in the filesystem, kept in an in
memory array - and now that we don't need to do that anymore, the in-memory
bucket array is gone, too. Specifically, we've got:

 - A new hash table for buckets awaiting journal commit before they can be
   reused, using cuckoo hashing (this one was rolled out awhile ago)

 - An extents-style freespace btree, to replace the code in the old allocator
   threads that periodically walked the arrays of buckets to build up freelists

 - A btree of buckets that need discarding before being moved to the freespace
   btree

 - A new LRU btree, for buckets containing cached data - replacing code in the
   allocator threads that would scan buckets and build up a heap of buckets to
   be reused.

The old allocator threads are completely gone - and the code that replaces them
all transactional b-tree code, much of it trigger based, that's _way_ easier to
debug and reason about. This fixes weird performance corner cases and
scalabiilty issues - in particular, the allocator threads were prone to using
excessive CPU when the filesystem was nearly full. Also, we've got a new and
much improved discard implementation! Previously, we'd only issue discards
shortly prior to reusing/writing to a bucket again - now, we'll issue discards
right after buckets become empty.

Exciting stuff - this was the biggest and most invasive change in quite awhile,
and I'm pretty happy with how it turned out.

Next big change is going to be the addition of backpointers to fix copygc
scanning, and a rebalance-work btree to fix rebalance thread scanning, and then
we'll be pretty much set for major scalability work.

Other recent changes/improvements: a lot of assorted debugability improvements.

 - list_journal improvements: now, when going emergency read only, we finish
   writing everything we have pending to the journal - we just mark them as
   noflush writes, so they'll never be used by recovery, but list_journal can
   still see them. This means when we detect an inconsistency, we can see all
   the updates leading up to it in the journal (along with what transactions
   were doing them), making it much easier to work backwards to what went wrong.

   We've been doing a lot of debugging lately with just list_journal and grep -
   yay for grep debugging!

 - A bunch of printbuf and to_text() method improvements, which make it easy to
   write good log messages when something goes wrong

 - Started moving some internal state used for debugging from sysfs to debugfs,
   where we can be much more verbose (yay for grep debugging!)

 - Fixed some snapshots bugs - figured out a major cause of the transaction path
   overflow bugs we've been seeing.

And, big thanks to all the people who put up with and test my crappy code and
help with finding all the bugs and beating it into shape :)