On Sun, Oct 20, 2024 at 01:54:17PM -0700, Linus Torvalds wrote: > On Sun, 20 Oct 2024 at 13:30, Kent Overstreet <kent.overstreet@xxxxxxxxx> wrote: > > > > Latency for journal replay? > > No, latency for the journaling itself. > > You're the one who claimed that a 2G cap on just the *index* to the > journal would be an "artificial cap on performance" when I suggested > just limiting the amount of memory you use on the journaling. That's the same as limiting the amount of dirty metadata. Any filesystem wouldn't want an artificial cap on dirty metadata, because so long as it's dirty you can buffer up more updates, but for bcachefs it's a bit different. Btree nodes are log structured: this was a dumb idea that turned out to be utterly brilliant, because it enabled eytzinger search trees, and because pure b-trees and pure compacting data structures (the leveldb lineage) have weakness. Pure b-trees (with a more typical 4k node size) mean you have no opportunity to compact random updates, so writing them out to disk is inefficient. The compacting data structures solve that at the cost of terrible multithreaded update performance, and the resort overhead also in practice becomes quite costly. So, log structured btree nodes mean we can _reasonably_ efficiently write out just the 4k/8k/whatever out of a 256k btree node - i.e. we can spray random updates across an enormous btree and serialize them with decent efficiency, and because the compacting only happens within a btree node it doesn't destroy multithreaded update performance, the resorts are much cheaper (they fit in cache), and we get to use eytzinger search trees for most lookups. But: it's still much better if we can do btree node writes that are as big as possible, for all the obvious reasons - and journal size is a major factor on real world performance. All the performance testing I do where we're filling up a filesystem with mostly metadata, the first thing the tests do is increase the journal size... > Other filesystems happily limit the amount of dirty data because of > latency concerns. And yes, it shows in benchmarks, where the > difference between having huge amounts of data pending in memory and > actually writing it back in a timely manner can be a noticeable > performance penalty. Well, ext3 historically had major architectural issues with journal latency - I never really studied that but from what I gather there were lots of issues around dependent write ordering. bcachefs doesn't have any of that - broadly speaking, dirty stuff can just be written out, and i.e. memory reclaim might flush dirty btree nodes before the journal does. (The one exception being interior nodes that have pointers to btree nodes that have just been created but not yet written). The only real latency concern we have with the journal is if it fills up entirely. That will cause latency spikes, because journal reclaim has to free up entire buckets at a time. They shouldn't be massive ones, because of the lack of dependent write issues (journal reclaim itself can run full tilt and won't be getting blocked, and it starts running full tilt once the journal is more than half full), and with the default journal size I doubt it'll happen much in real world usage. But it is something people looking at performance should watch for, and we've got time stats in sysfs for this: /sys/fs/bcachefs/<uuid>/time_stats/blocked_journal_low_on_space Also worth watching is /sys/fs/bcachefs/<uuid>/time_stats/blocked_journal_max_in_flight This one tracks when we can't open a new journal entry because we already have the max (4) in flight, closing or writing - it'll happen if something is doing constant fsyncs, forcing us to write much smaller journal entries than we'd like. That was coming up in conversation the other day, and I might be doing another XFS-like thing to address it if it starts becoming an issue in real world usage. If latency due to the journal filling up starts being an issue for people, the first thing to do will be to just resize the journal (we can do that online, and I might make that happen automatically at some point) - and if for some reason that's not an option we'll need to add more intelligent throttling, so that the latencies happen bit by bit instead of all at once. But the next thing I need to do in that area has nothing to with throughput: a common complaint has been that we trickle out writes when the system is idle, and that behaviour dates from the days when bcache was designed for servers optimized for throughput and smoothing out bursty workloads. Nowadays, a lot of people are running it on their desktops or laptops, and we need to be able to do a "rush to idle" type thing. I have a design doc for that I wrote fully a year ago and haven't gotten to yet... https://github.com/koverstreet/bcachefs/commit/98a70eef3d46397a085069531fc503dae20d63fb