Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO

Kent Overstreet <kent.overstreet@xxxxxxxxx> · Mon, 26 Feb 2024 16:17:19 -0500

On Mon, Feb 26, 2024 at 09:07:51PM +0000, Matthew Wilcox wrote:
> On Mon, Feb 26, 2024 at 09:17:33AM -0800, Linus Torvalds wrote:
> > Willy - tangential side note: I looked closer at the issue that you
> > reported (indirectly) with the small reads during heavy write
> > activity.
> > 
> > Our _reading_ side is very optimized and has none of the write-side
> > oddities that I can see, and we just have
> > 
> >   filemap_read ->
> >     filemap_get_pages ->
> >         filemap_get_read_batch ->
> >           folio_try_get_rcu()
> > 
> > and there is no page locking or other locking involved (assuming the
> > page is cached and marked uptodate etc, of course).
> > 
> > So afaik, it really is just that *one* atomic access (and the matching
> > page ref decrement afterwards).
> 
> Yep, that was what the customer reported on their ancient kernel, and
> we at least didn't make that worse ...
> 
> > We could easily do all of this without getting any ref to the page at
> > all if we did the page cache release with RCU (and the user copy with
> > "copy_to_user_atomic()").  Honestly, anything else looks like a
> > complete disaster. For tiny reads, a temporary buffer sounds ok, but
> > really *only* for tiny reads where we could have that buffer on the
> > stack.
> > 
> > Are tiny reads (handwaving: 100 bytes or less) really worth optimizing
> > for to that degree?
> > 
> > In contrast, the RCU-delaying of the page cache might be a good idea
> > in general. We've had other situations where that would have been
> > nice. The main worry would be low-memory situations, I suspect.
> > 
> > The "tiny read" optimization smells like a benchmark thing to me. Even
> > with the cacheline possibly bouncing, the system call overhead for
> > tiny reads (particularly with all the mitigations) should be orders of
> > magnitude higher than two atomic accesses.
> 
> Ah, good point about the $%^&^*^ mitigations.  This was pre mitigations.
> I suspect that this customer would simply disable them; afaik the machine
> is an appliance and one interacts with it purely by sending transactions
> to it (it's not even an SQL system, much less a "run arbitrary javascript"
> kind of system).  But that makes it even more special case, inapplicable
> to the majority of workloads and closer to smelling like a benchmark.
> 
> I've thought about and rejected RCU delaying of the page cache in the
> past.  With the majority of memory in anon memory & file memory, it just
> feels too risky to have so much memory waiting to be reused.  We could
> also improve gup-fast if we could rely on RCU freeing of anon memory.
> Not sure what workloads might benefit from that, though.

RCU allocating and freeing of memory can already be fairly significant
depending on workload, and I'd expect that to grow - we really just need
a way for reclaim to kick RCU when needed (and probably add a percpu
counter for "amount of memory stranded until the next RCU grace
period").