Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO

Kent Overstreet <kent.overstreet@xxxxxxxxx> · Sun, 25 Feb 2024 16:29:58 -0500

On Sun, Feb 25, 2024 at 09:03:32AM -0800, Linus Torvalds wrote:
> On Sun, 25 Feb 2024 at 05:10, Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote:
> >
> > There's also the small random 64 byte read case that we haven't optimised
> > for yet.  That also bottlenecks on the page refcount atomic op.
> >
> > The proposed solution to that was double-copy; look up the page without
> > bumping its refcount, copy to a buffer, look up the page again to be
> > sure it's still there, copy from the buffer to userspace.
> 
> Please stop the cray-cray.
> 
> Yes, cache dirtying is expensive. But you don't actually have
> cacheline ping-pong, because you don't have lots of different CPU's
> hammering the same page cache page in any normal circumstances. So the
> really expensive stuff just doesn't exist.

Not ping pong, you're just blowing the cachelines you want out of l1
with the big usercopy, hardware caches not being fully associative.

> I think you've been staring at profiles too much. In instruction-level
> profiles, the atomic ops stand out a lot. But that's at least partly
> artificial - they are a serialization point on x86, so things get
> accounted to them. So they tend to be the collection point for
> everything around them in an OoO CPU.

Yes, which leads to a fun game of whack a mole when you eliminate one
atomic op and then everything just ends up piling up behind a different
atomic op - but for the buffered read path, the folio get/put are the
only atomic ops.

> Fior example, the fact that Kent complains about the page cache and
> talks about large folios is completely ludicrous. I've seen the
> benchmarks of real loads. Kent - you're not close to any limits, you
> are often a factor of two to five off other filesystems. We're not
> talking "a few percent", and we're not talking "the atomics are
> hurting".

Yes, there's a bunch of places where bcachefs is still slow; it'll get
there :)

If you've got those benchmarks handyy and they're ones I haven't seen,
I'd love to take look; the one that always jumps out at people is small
O_DIRECT reads, and that hasn't been a priority because O_DIRECT doesn't
matter to most people nearly as much as they think it does.

There's a bunch of stuff still to work through; another that comes to
mind is that we need a free inodes btree to eliminate scanning in inode
create, and that was half a day of work - except it also needs sharding
(i.e. leaf nodes can't span certain boundaries), and for that I need
variable sized btree nodes so we aren't burning stupid amounts of
memory - and that's something we need anyways, number of btrees growing
like it is.

Another fun one that I just discovered while I was hanging out at
Darrick's - journal was stalling on high iodepth workloads; device write
buffer fills up, write latency goes up, suddenly the journal can't write
quickly enough when it's only submitting one write at a time. So there's
a fix for 6.9 queued up that lets the journal keep multiple writes in
flight.

That one was worth mentioning because another fix would've been to add a
way to signal backpressure to /above/ the filesystem, so that we don't
hit such big queuing delays within the filesystem; right now user writes
don't hit backpressure until submit_bio() blocks because the request
queue is full. I've been seeing other performance corner cases where it
looks like such a mechanism would be helpful.

I except I've got a solid year or two ahead of me of mastly just working
through performance bugs - standing up a lot of automated perf testing
adn whatnot. But, one thing at a time...