Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO

Kent Overstreet <kent.overstreet@xxxxxxxxx> · Tue, 27 Feb 2024 10:52:51 -0500

On Tue, Feb 27, 2024 at 07:32:32AM -0800, Paul E. McKenney wrote:
> I could simply use the same general approach that I use within RCU
> itself, which currently has absolutely no idea how much memory (if any)
> that each callback will free.  Especially given that some callbacks
> free groups of memory blocks, while other free nothing.  ;-)
> 
> Alternatively, we could gather statistics on the amount of memory freed
> by each callback and use that as an estimate.
> 
> But we should instead step back and ask exactly what we are trying to
> accomplish here, which just might be what Dave Chinner was getting at.
> 
> At a ridiculously high level, reclaim is looking for memory to free.
> Some read-only memory can often be dropped immediately on the grounds
> that its data can be read back in if needed.  Other memory can only be
> dropped after being written out, which involves a delay.  There are of
> course many other complications, but this will do for a start.
> 
> So, where does RCU fit in?
> 
> RCU fits in between the two.  With memory awaiting RCU, there is no need
> to write anything out, but there is a delay.  As such, memory waiting
> for an RCU grace period is similar to memory that is to be reclaimed
> after its I/O completes.
> 
> One complication, and a complication that we are considering exploiting,
> is that, unlike reclaimable memory waiting for I/O, we could often
> (but not always) have some control over how quickly RCU's grace periods
> complete.  And we already do this programmatically by using the choice
> between sychronize_rcu() and synchronize_rcu_expedited().  The question
> is whether we should expedite normal RCU grace periods during reclaim,
> and if so, under what conditions.
> 
> You identified one potential condition, namely the amount of memory
> waiting to be reclaimed.  One complication with this approach is that RCU
> has no idea how much memory each callback represents, and for call_rcu(),
> there is no way for it to find out.  For kfree_rcu(), there are ways,
> but as you know, I am questioning whether those ways are reasonable from
> a performance perspective.  But even if they are, we would be accepting
> more error from the memory waiting via call_rcu() than we would be
> accepting if we just counted blocks instead of bytes for kfree_rcu().

You're _way_ overcomplicating this.

The relevant thing to consider is the relative cost of __ksize() and
kfree_rcu(). __ksize() is already pretty cheap, and with slab gone and
space available in struct slab we can get it down to a single load.

> Let me reiterate that:  The estimation error that you are objecting to
> for kfree_rcu() is completely and utterly unavoidable for call_rcu().

hardly, callsites manually freeing memory manually after an RCU grace
period can do the accounting manually - if they're hot enough to matter,
most aren.t

and with memory allocation profiling coming, which also tracks # of
allocations, we'll also have an easy way to spot those.