Time to try and summarize everything that's been going on in bcachefs land since my last lkml posting, and my thoughts on what's next. Core btree improvements: - Updates to interior btree nodes are now journalled. - We're now updating parent btree node pointers on every btree write. This was a pretty major improvement - it means we can now always detect lost btree btree writes, which was a hole in encrypted mode and also turned out to be a robustness issue in RAID mode. It also means we can start to drop the journal sequence number blacklist mechanism and closed some rare corner case issues. And thanks to the previous item, it didn't cost us any performance. - We no longer have to mark every journal write as flush/fua - stole this idea from XFS, it was a pretty nice performance improvement. - Lots of btree locking improvements: notably, we now have assertions that we never hold btree locks while doing IO. This is really good for tail latency. - The transaction model is steadily improving and gaining more and more assertions; this makes it easier to write upper level FS code without worrying about locking considerations. We've started requiring every btree transaction to start with bch2_trans_begin(), and in particular there's asserts that this is the next thing called after a transaction restart. Catching random little bugs with new assertions is a good feeling. - The btree iterator code has now been split up into btree_iter and btree_path; btree_path implements the "path to a particular position in the btree" code, and btree_iter sits on top of that and implements iteration over keys, iteration over slots, iteration over extents, iteration for snapshots (that's a whole thing), and more - this refactoring came about during the work for snapshots and it turned out really nicely. Recovery: - All alloc info is now updated fully transactionally. Originally we'd have to regenerate alloc info on every mount, then after every unclean shutdown - then for a long time we only had to regenerate alloc info for metadata after unclean shutdown. With updates to interior btree nodes being fully journalled, that makes updates to alloc info fully transactional and our mount times fast. Currently we still have to read all alloc info into memory on mount, but that too will be changing. Features: - Reflink: I believe all the bugs have finally been shaken out. The last bug to be found was a refcount leak when we fragmented an existing indirect extent (by copygc/rebalance), and a reflink pointer only pointed to part of it. - Erasure coding - we're still popping some silly assertions, it's on my todo list - Encryption: people keep wanting AES support, so at some point I'll try and find the time to add AES/GCM. - SNAPSHOTS ARE DONE (mostly), and they're badass. I've successfully gotten up to a million snapshots (only changing a single file in each snapshot) in a VM. They scale. Fsck scales. Take as many snapshots as you want. Go wild. Still todo: - need to export a different st_dev for each subvolume, like btrfs, so that find -xdev does what you want and skips snapshots - we would like better atomicity w.r.t. pagecache on snapshot creation, and it'd be nice if we didn't have to do a big sync when creating a snapshot - we could do this by getting the subvolume's current snapshot ID at buffered write time, but there's other things that make this hard - we need per-snapshot ID disk space accounting. This is going to have to wait for a giant disk space accounting rework though, which will move disk space accounting out of the journal and to a dedicated btree. - userspace interface is very minimal - e.g. still need to implement recursive snapshotting. - quota support is currently disabled, because of interactions with snapshots; re-enabling that is high on my todo list. - the btree key cache is currently disabled for inodes, also because of interactions with snapshots: this is a performance regression until we get this solved. About a year of my life went into snapshots and I'm _really_ proud with how they turned out - in terms of algorithmic complexity, snapshots has been the biggest single feature tackled and when I started there were a lot of big unknowns that I honestly wasn't sure I was going to find solutions for. Still waiting on more people to start really testing with them and banging on them (and we do still need more tests written) but so far shaking things out has gone really smoothly (more smoothly than erasure coding, that's for sure!) FUTURE WORK: I'm going to start really getting on people for review and working on upstreaming this beast. I intend for it to be marked EXPERIMENTAL for awhile, naturally - there are still on disk format changes coming that will be forced upgrades. But getting snapshots done was the big goal I'd set for myself, so it's time. Besides that, my next big focus is going to be on scalability. bcachefs was hitting 50 TB volumes even before it was called bcachefs - I fully intend for it to scale to 50 PB. To get there, we need to: - Get rid of the in-memory bucket array. We're mostly there, all allocation information lives in the btree, but we need to make more improvements to the btree representation before we can ditch the pure in memory representation. - We need new persistent data structures for the allocator, so that the allocator doesn't have to scan buckets. First up will be implementing a persistent LRU, then probably a free space btree. - We need a backpointers btree, so that copygc doesn't have to scan the extents/reflink btrees. - Online fsck. This will come in stages: first, theres's the filesystem level fsck code in fs/bcachefs/fsck.c. The recent work improving the btree transaction layer and adding assertions there has been forcing the fsck code to change to be more rigorously correct (in the context of running concurrently with other filesytem operations); a lot of that code is most of the way there now. We'll need additional locking vs. other filesystem code for the directory structure and inode nlinks passes, but shouldnt't for the rest of the passes. After fsck.c is running concurrently, it'll be time to bring back concurrent btree gc, which regenerates alloc info. Woohoo. ------------- End brain dump, thank you kindly for reading :)