Re: [PATCH] mm: Drop INT_MAX limit from kvmalloc()

Kent Overstreet <kent.overstreet@xxxxxxxxx> · Sun, 20 Oct 2024 17:29:05 -0400

On Sun, Oct 20, 2024 at 01:54:17PM -0700, Linus Torvalds wrote:
> On Sun, 20 Oct 2024 at 13:30, Kent Overstreet <kent.overstreet@xxxxxxxxx> wrote:
> >
> > Latency for journal replay?
> 
> No, latency for the journaling itself.
> 
> You're the one who claimed that a 2G cap on just the *index* to the
> journal would be an "artificial cap on performance" when I suggested
> just limiting the amount of memory you use on the journaling.

That's the same as limiting the amount of dirty metadata.

Any filesystem wouldn't want an artificial cap on dirty metadata,
because so long as it's dirty you can buffer up more updates, but for
bcachefs it's a bit different.

Btree nodes are log structured: this was a dumb idea that turned out to
be utterly brilliant, because it enabled eytzinger search trees, and
because pure b-trees and pure compacting data structures (the leveldb
lineage) have weakness. Pure b-trees (with a more typical 4k node size)
mean you have no opportunity to compact random updates, so writing them
out to disk is inefficient. The compacting data structures solve that at
the cost of terrible multithreaded update performance, and the resort
overhead also in practice becomes quite costly.

So, log structured btree nodes mean we can _reasonably_ efficiently
write out just the 4k/8k/whatever out of a 256k btree node - i.e. we can
spray random updates across an enormous btree and serialize them with
decent efficiency, and because the compacting only happens within a
btree node it doesn't destroy multithreaded update performance, the
resorts are much cheaper (they fit in cache), and we get to use
eytzinger search trees for most lookups.

But: it's still much better if we can do btree node writes that are as
big as possible, for all the obvious reasons - and journal size is a
major factor on real world performance. All the performance testing I do
where we're filling up a filesystem with mostly metadata, the first
thing the tests do is increase the journal size...

> Other filesystems happily limit the amount of dirty data because of
> latency concerns. And yes, it shows in benchmarks, where the
> difference between having huge amounts of data pending in memory and
> actually writing it back in a timely manner can be a noticeable
> performance penalty.

Well, ext3 historically had major architectural issues with journal
latency - I never really studied that but from what I gather there were
lots of issues around dependent write ordering. bcachefs doesn't have
any of that - broadly speaking, dirty stuff can just be written out, and
i.e. memory reclaim might flush dirty btree nodes before the journal
does.

(The one exception being interior nodes that have pointers to btree
nodes that have just been created but not yet written).

The only real latency concern we have with the journal is if it fills
up entirely. That will cause latency spikes, because journal reclaim has
to free up entire buckets at a time. They shouldn't be massive ones,
because of the lack of dependent write issues (journal reclaim itself
can run full tilt and won't be getting blocked, and it starts running
full tilt once the journal is more than half full), and with the default
journal size I doubt it'll happen much in real world usage.

But it is something people looking at performance should watch for, and
we've got time stats in sysfs for this:
/sys/fs/bcachefs/<uuid>/time_stats/blocked_journal_low_on_space

Also worth watching is 
/sys/fs/bcachefs/<uuid>/time_stats/blocked_journal_max_in_flight

This one tracks when we can't open a new journal entry because we
already have the max (4) in flight, closing or writing - it'll happen if
something is doing constant fsyncs, forcing us to write much smaller
journal entries than we'd like. That was coming up in conversation the
other day, and I might be doing another XFS-like thing to address it if
it starts becoming an issue in real world usage.

If latency due to the journal filling up starts being an issue for
people, the first thing to do will be to just resize the journal (we can
do that online, and I might make that happen automatically at some
point) - and if for some reason that's not an option we'll need to add
more intelligent throttling, so that the latencies happen bit by bit
instead of all at once.

But the next thing I need to do in that area has nothing to with
throughput: a common complaint has been that we trickle out writes when
the system is idle, and that behaviour dates from the days when bcache
was designed for servers optimized for throughput and smoothing out
bursty workloads. Nowadays, a lot of people are running it on their
desktops or laptops, and we need to be able to do a "rush to idle" type
thing.

I have a design doc for that I wrote fully a year ago and haven't gotten
to yet... https://github.com/koverstreet/bcachefs/commit/98a70eef3d46397a085069531fc503dae20d63fb