New allocator ------------- The old allocator, fundamentally unchanged from bcache had been thoroughly outgrown and was in need of a rewrite. It kept allocation information pinned in memory and it did periodic full scans of that aloc info to find free buckets, which had become a real scalability issue. It was also problematic to debug, being based around purely in-memory state with tricky locking and tricky state transitions. That's all been completely rewritten and replaced with new algorithms based on new persistent data structures. The new code is simpler, vastly more scalable (and we had users with 50 TB filesystems before), and since all state transitions show up in the journal it's been much easier to debug. Backpointers ------------ Backpointers are enable doing a lookup from physical device:lba to the extent that owns that lba. There's a number of reasons filesystems might want them, but in our context we specifically needed them to improve copygc scalability. Before, copygc had to periodically walk the entire extents + reflink btrees; now it just picks the next-most-empty bucket and moves all the extents it contains. With backpointers done we're now largely done with scalability work - excepting online fsck. We also still need to add a rebalance_work btree to fix rebalance scanning, but that's not as serious since we never depend on rebalance for allocations to make forward progress, but this'll be a relatively small chunk of work. Snapshots largely stabilized ---------------------------- Quotas don't yet work with snapshots, but aside from this all tests are passing. There's still a few minor bugs we're trying to get reproduced, but nothing that should affect system availability (exception: snapshot delete path still sucks, the code to tear down the pagecache should be improved). Erasure coding (RAID 5/6) getting close to usable ------------------------------------------------- The codepath for deciding when to create new stripes or update an existing, partially-empty stripe still needs to be improved. Background: bcachefs avoids the write hole problem in other raid implementations (by being COW), and it doesn't fragment writes like ZFS does; stripes are big (which we want for avoiding fragmentation), and they're created on demand out of physical buckets on different devices. Foreground writes are initially replicated, then when we accumulate a full stripe we write out p+q blocks and update extents in the stripe with the stripe information and drop the now unnneeded replicas. The buckets containing the extra replicas will get reused right away, so if we don't have to send a cache flush before they're overwritten with something else they only cost bus bandwidth. As data gets overwritten, or moved by copygc, we'll end up with some of the buckets in a stripe becoming empty. The stripe creation path has the ability to create a new stripe using the buckets that still contain live data in an existing stripe - but it's more efficient in terms of IO to create new stripes if the data in an existing stripe is going to die at some point in the future - OTOH it requires more disk space. Once this logic is figured out, erasure coding should be pretty close to ready. Lock cycle detector ------------------- We'd outgrown the old deadlock avoidance strategy: before blocking on a lock we'd check for lock ordering violation, and if necessary issue a transaction restart which would re-traverse iterators in the correct order. The lock ordering rules had become too complicated and this was getting us too many transaction restarts, so I stole the standard technique from databases, which is now working beautifully - transaction restarts due to deadlock are a tiny fraction of what they were before. Performance work ---------------- We now have _lots_ of persistent counters, including for transaction restarts (of every type), and every slowpath - and, the automated tests now check these counters at the end of every test, and fail the test if any of them were too high. And, thanks to a lot of miscellaneous performance work, our 4k O_DIRECT random write performance is now > 50% better than it was a few months ago. Automated test infrastructure! ------------------------------ I finally built the CI I always wanted: you point it at a git branch and it runs tests and gives you the results in a git log view. See it in action here: https://evilpiepirate.org/~testdashboard/ci?log=bcachefs/master Since then I've been putting a ton of time into grinding through bugs, and the the new test dashboard has been amazing for tracking down regressions. There's not much more work planned prior to upstreaming: on disk format changes have slowed down considerably. I just revved the on disk format version to introduce a new inode format (which doesn't varint encode i_sectors or i_size, which makes the data write path faster), and I'm going to _attempt_ to expand the u64s field of struct bkey from one byte to two, but other than that - nothing big expected for awhile. I also just gave a talk to RH staff - lots of good stuff in there: https://bcachefs.org/bcachefs_talk_2022_10.mpv Cheers, and thanks for reading Kent