bcachefs status update - current and future work

Kent Overstreet <kent.overstreet@xxxxxxxxx> · Thu, 4 Nov 2021 13:35:53 -0400

Time to try and summarize everything that's been going on in bcachefs land since
my last lkml posting, and my thoughts on what's next.

Core btree improvements:
 - Updates to interior btree nodes are now journalled.

 - We're now updating parent btree node pointers on every btree write. This was
   a pretty major improvement - it means we can now always detect lost btree
   btree writes, which was a hole in encrypted mode and also turned out to be a
   robustness issue in RAID mode. It also means we can start to drop the journal
   sequence number blacklist mechanism and closed some rare corner case issues.
   And thanks to the previous item, it didn't cost us any performance.

 - We no longer have to mark every journal write as flush/fua - stole this idea
   from XFS, it was a pretty nice performance improvement.

 - Lots of btree locking improvements: notably, we now have assertions that we
   never hold btree locks while doing IO. This is really good for tail latency.

 - The transaction model is steadily improving and gaining more and more
   assertions; this makes it easier to write upper level FS code without
   worrying about locking considerations. We've started requiring every btree
   transaction to start with bch2_trans_begin(), and in particular there's
   asserts that this is the next thing called after a transaction restart.
   Catching random little bugs with new assertions is a good feeling.

 - The btree iterator code has now been split up into btree_iter and btree_path;
   btree_path implements the "path to a particular position in the btree" code,
   and btree_iter sits on top of that and implements iteration over keys,
   iteration over slots, iteration over extents, iteration for snapshots (that's
   a whole thing), and more - this refactoring came about during the work for
   snapshots and it turned out really nicely.

Recovery:
 - All alloc info is now updated fully transactionally. Originally we'd have to
   regenerate alloc info on every mount, then after every unclean shutdown -
   then for a long time we only had to regenerate alloc info for metadata after
   unclean shutdown. With updates to interior btree nodes being fully
   journalled, that makes updates to alloc info fully transactional and our
   mount times fast.

   Currently we still have to read all alloc info into memory on mount, but that
   too will be changing.

Features:
 - Reflink: I believe all the bugs have finally been shaken out. The last bug to
   be found was a refcount leak when we fragmented an existing indirect extent
   (by copygc/rebalance), and a reflink pointer only pointed to part of it. 

 - Erasure coding - we're still popping some silly assertions, it's on my todo
   list

 - Encryption: people keep wanting AES support, so at some point I'll try and
   find the time to add AES/GCM.

 - SNAPSHOTS ARE DONE (mostly), and they're badass. 

   I've successfully gotten up to a million snapshots (only changing a single
   file in each snapshot) in a VM. They scale. Fsck scales. Take as many
   snapshots as you want. Go wild.

   Still todo:
   - need to export a different st_dev for each subvolume, like btrfs, so that
     find -xdev does what you want and skips snapshots

   - we would like better atomicity w.r.t. pagecache on snapshot creation, and
     it'd be nice if we didn't have to do a big sync when creating a snapshot -
     we could do this by getting the subvolume's current snapshot ID at buffered
     write time, but there's other things that make this hard

   - we need per-snapshot ID disk space accounting. This is going to have to
     wait for a giant disk space accounting rework though, which will move disk
     space accounting out of the journal and to a dedicated btree.

   - userspace interface is very minimal - e.g. still need to implement
     recursive snapshotting.

   - quota support is currently disabled, because of interactions with
     snapshots; re-enabling that is high on my todo list.

   - the btree key cache is currently disabled for inodes, also because of
     interactions with snapshots: this is a performance regression until we get
     this solved.

About a year of my life went into snapshots and I'm _really_ proud with how they
turned out - in terms of algorithmic complexity, snapshots has been the biggest
single feature tackled and when I started there were a lot of big unknowns that
I honestly wasn't sure I was going to find solutions for. Still waiting on
more people to start really testing with them and banging on them (and we do
still need more tests written) but so far shaking things out has gone really
smoothly (more smoothly than erasure coding, that's for sure!)

FUTURE WORK:

I'm going to start really getting on people for review and working on
upstreaming this beast. I intend for it to be marked EXPERIMENTAL for awhile,
naturally - there are still on disk format changes coming that will be forced
upgrades. But getting snapshots done was the big goal I'd set for myself, so
it's time.

Besides that, my next big focus is going to be on scalability. bcachefs was
hitting 50 TB volumes even before it was called bcachefs - I fully intend for it
to scale to 50 PB. To get there, we need to:

 - Get rid of the in-memory bucket array. We're mostly there, all allocation
   information lives in the btree, but we need to make more improvements to the
   btree representation before we can ditch the pure in memory representation.

   - We need new persistent data structures for the allocator, so that the
     allocator doesn't have to scan buckets. First up will be implementing a
     persistent LRU, then probably a free space btree.

 - We need a backpointers btree, so that copygc doesn't have to scan the
   extents/reflink btrees.

 - Online fsck. This will come in stages: first, theres's the filesystem level
   fsck code in fs/bcachefs/fsck.c. The recent work improving the btree
   transaction layer and adding assertions there has been forcing the fsck code
   to change to be more rigorously correct (in the context of running
   concurrently with other filesytem operations); a lot of that code is most of
   the way there now. We'll need additional locking vs. other filesystem code
   for the directory structure and inode nlinks passes, but shouldnt't for the
   rest of the passes.

   After fsck.c is running concurrently, it'll be time to bring back concurrent
   btree gc, which regenerates alloc info. Woohoo.

-------------
End brain dump, thank you kindly for reading :)