Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Sun, 25 Feb 2024 09:03:32 -0800

On Sun, 25 Feb 2024 at 05:10, Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote:
>
> There's also the small random 64 byte read case that we haven't optimised
> for yet.  That also bottlenecks on the page refcount atomic op.
>
> The proposed solution to that was double-copy; look up the page without
> bumping its refcount, copy to a buffer, look up the page again to be
> sure it's still there, copy from the buffer to userspace.

Please stop the cray-cray.

Yes, cache dirtying is expensive. But you don't actually have
cacheline ping-pong, because you don't have lots of different CPU's
hammering the same page cache page in any normal circumstances. So the
really expensive stuff just doesn't exist.

I think you've been staring at profiles too much. In instruction-level
profiles, the atomic ops stand out a lot. But that's at least partly
artificial - they are a serialization point on x86, so things get
accounted to them. So they tend to be the collection point for
everything around them in an OoO CPU.

Yes, atomics are bad. But double buffering is worse, and only looks
good if you have some artificial benchmark that does some single-byte
hot-cache read in a loop.

In fact, I get the strong feeling that the complaints come from people
who have looked at bad microbenchmarks a bit too much. People who have
artificially removed the *real* costs by putting their data on a
ramdisk, and then run a microbenchmark on this artificial setup.

So you have a make-believe benchmark on a make-believe platform, and
you may have started out with the best of intentions ("what are the
limits"), but at some point you took a wrong turn, and turned that
"what are the limits of performance" and turned that into an
instruction-level profile and tried to mis-optimize the limits,
instead of realizing that that is NOT THE POINT of a "what are the
limits" question.

The point of doing limit analysis is not to optimize the limit. It's
to see how close you are to that limit in real loads.

And I pretty much guarantee that you aren't close to those limits on
any real loads.

Before filesystem people start doing crazy things like double
buffering to do RCU reading of the page cache, you need to look
yourself in the mirror.

Fior example, the fact that Kent complains about the page cache and
talks about large folios is completely ludicrous. I've seen the
benchmarks of real loads. Kent - you're not close to any limits, you
are often a factor of two to five off other filesystems. We're not
talking "a few percent", and we're not talking "the atomics are
hurting".

So people: wake up and smell the coffee.  Don't optimize based off
profiles of micro-benchmarks on made up platforms. That's for seeing
where the limits are. And YOU ARE NOT EVEN CLOSE.

                  Linus