On Mon, Feb 26, 2024 at 09:07:51PM +0000, Matthew Wilcox wrote: > On Mon, Feb 26, 2024 at 09:17:33AM -0800, Linus Torvalds wrote: > > Willy - tangential side note: I looked closer at the issue that you > > reported (indirectly) with the small reads during heavy write > > activity. > > > > Our _reading_ side is very optimized and has none of the write-side > > oddities that I can see, and we just have > > > > filemap_read -> > > filemap_get_pages -> > > filemap_get_read_batch -> > > folio_try_get_rcu() > > > > and there is no page locking or other locking involved (assuming the > > page is cached and marked uptodate etc, of course). > > > > So afaik, it really is just that *one* atomic access (and the matching > > page ref decrement afterwards). > > Yep, that was what the customer reported on their ancient kernel, and > we at least didn't make that worse ... > > > We could easily do all of this without getting any ref to the page at > > all if we did the page cache release with RCU (and the user copy with > > "copy_to_user_atomic()"). Honestly, anything else looks like a > > complete disaster. For tiny reads, a temporary buffer sounds ok, but > > really *only* for tiny reads where we could have that buffer on the > > stack. > > > > Are tiny reads (handwaving: 100 bytes or less) really worth optimizing > > for to that degree? > > > > In contrast, the RCU-delaying of the page cache might be a good idea > > in general. We've had other situations where that would have been > > nice. The main worry would be low-memory situations, I suspect. > > > > The "tiny read" optimization smells like a benchmark thing to me. Even > > with the cacheline possibly bouncing, the system call overhead for > > tiny reads (particularly with all the mitigations) should be orders of > > magnitude higher than two atomic accesses. > > Ah, good point about the $%^&^*^ mitigations. This was pre mitigations. > I suspect that this customer would simply disable them; afaik the machine > is an appliance and one interacts with it purely by sending transactions > to it (it's not even an SQL system, much less a "run arbitrary javascript" > kind of system). But that makes it even more special case, inapplicable > to the majority of workloads and closer to smelling like a benchmark. > > I've thought about and rejected RCU delaying of the page cache in the > past. With the majority of memory in anon memory & file memory, it just > feels too risky to have so much memory waiting to be reused. We could > also improve gup-fast if we could rely on RCU freeing of anon memory. > Not sure what workloads might benefit from that, though. RCU allocating and freeing of memory can already be fairly significant depending on workload, and I'd expect that to grow - we really just need a way for reclaim to kick RCU when needed (and probably add a percpu counter for "amount of memory stranded until the next RCU grace period").