On Sun, 25 Feb 2024 at 05:10, Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote: > > There's also the small random 64 byte read case that we haven't optimised > for yet. That also bottlenecks on the page refcount atomic op. > > The proposed solution to that was double-copy; look up the page without > bumping its refcount, copy to a buffer, look up the page again to be > sure it's still there, copy from the buffer to userspace. Please stop the cray-cray. Yes, cache dirtying is expensive. But you don't actually have cacheline ping-pong, because you don't have lots of different CPU's hammering the same page cache page in any normal circumstances. So the really expensive stuff just doesn't exist. I think you've been staring at profiles too much. In instruction-level profiles, the atomic ops stand out a lot. But that's at least partly artificial - they are a serialization point on x86, so things get accounted to them. So they tend to be the collection point for everything around them in an OoO CPU. Yes, atomics are bad. But double buffering is worse, and only looks good if you have some artificial benchmark that does some single-byte hot-cache read in a loop. In fact, I get the strong feeling that the complaints come from people who have looked at bad microbenchmarks a bit too much. People who have artificially removed the *real* costs by putting their data on a ramdisk, and then run a microbenchmark on this artificial setup. So you have a make-believe benchmark on a make-believe platform, and you may have started out with the best of intentions ("what are the limits"), but at some point you took a wrong turn, and turned that "what are the limits of performance" and turned that into an instruction-level profile and tried to mis-optimize the limits, instead of realizing that that is NOT THE POINT of a "what are the limits" question. The point of doing limit analysis is not to optimize the limit. It's to see how close you are to that limit in real loads. And I pretty much guarantee that you aren't close to those limits on any real loads. Before filesystem people start doing crazy things like double buffering to do RCU reading of the page cache, you need to look yourself in the mirror. Fior example, the fact that Kent complains about the page cache and talks about large folios is completely ludicrous. I've seen the benchmarks of real loads. Kent - you're not close to any limits, you are often a factor of two to five off other filesystems. We're not talking "a few percent", and we're not talking "the atomics are hurting". So people: wake up and smell the coffee. Don't optimize based off profiles of micro-benchmarks on made up platforms. That's for seeing where the limits are. And YOU ARE NOT EVEN CLOSE. Linus