Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO

"Paul E. McKenney" <paulmck@xxxxxxxxxx> · Tue, 27 Feb 2024 09:58:48 -0800

On Tue, Feb 27, 2024 at 11:34:06AM -0500, Kent Overstreet wrote:
> On Tue, Feb 27, 2024 at 08:21:29AM -0800, Paul E. McKenney wrote:
> > On Tue, Feb 27, 2024 at 03:54:23PM +0000, Matthew Wilcox wrote:
> > > On Tue, Feb 27, 2024 at 07:32:32AM -0800, Paul E. McKenney wrote:
> > > > At a ridiculously high level, reclaim is looking for memory to free.
> > > > Some read-only memory can often be dropped immediately on the grounds
> > > > that its data can be read back in if needed.  Other memory can only be
> > > > dropped after being written out, which involves a delay.  There are of
> > > > course many other complications, but this will do for a start.
> > > 
> > > Hi Paul,
> > > 
> > > I appreciate the necessity of describing what's going on at a very high
> > > level, but there's a wrinkle that I'm not sure you're aware of which
> > > may substantially change your argument.
> > > 
> > > For anonymous memory, we do indeed wait until reclaim to start writing it
> > > to swap.  That may or may not be the right approach given how anonymous
> > > memory is used (and could be the topic of an interesting discussion
> > > at LSFMM).
> > > 
> > > For file-backed memory, we do not write back memory in reclaim.  If it
> > > has got to the point of calling ->writepage in vmscan, things have gone
> > > horribly wrong to the point where calling ->writepage will make things
> > > worse.  This is why we're currently removing ->writepage from every
> > > filesystem (only ->writepages will remain).  Instead, the page cache
> > > is written back much earlier, once we get to balance_dirty_pages().
> > > That lets us write pages in filesystem-friendly ways instead of in MM
> > > LRU order.
> > 
> > Thank you for the additional details.
> > 
> > But please allow me to further summarize the point of my prior email
> > that seems to be getting lost:
> > 
> > 1.	RCU already does significant work prodding grace periods.
> > 
> > 2.	There is no reasonable way to provide estimates of the
> > 	memory sent to RCU via call_rcu(), and in many cases
> > 	the bulk of the waiting memory will be call_rcu() memory.
> > 
> > Therefore, if we cannot come up with a heuristic that does not need to
> > know the bytes of memory waiting, we are stuck anyway.
> 
> That is a completely asinine argument.

Huh.  Anything else you need to get off your chest?

On the off-chance it is unclear, I do disagree with your assessment.

> > So perhaps the proper heuristic for RCU speeding things up is simply
> > "Hey RCU, we are in reclaim!".
> 
> Because that's the wrong heuristic. There are important workloads for
> which  we're _always_ in reclaim, but as long as RCU grace periods are
> happening at some steady rate, the amount of memory stranded will be
> bounded and there's no reason to expedite grace periods.

RCU is in fact designed to handle heavy load, and contains a number of
mechanisms to operate more efficiently at higher load than at lower load.
It also contains mechanisms to expedite grace periods under heavy load.
Some of which I already described in earlier emails on this thread.

> If we start RCU freeing all pagecache folios we're going to be cycling
> memory through RCU freeing at the rate of gigabytes per second, tens of
> gigabytes per second on high end systems.

The load on RCU would be measured in terms of requests (kfree_rcu() and
friends) per unit time per CPU, not in terms of gigabytes per unit time.
Of course, the amount of memory per unit time might be an issue for
whatever allocators you are using, and as Matthew has often pointed out,
the deferred reclamation incurs additional cache misses.

And rcutorture really does do forward-progress testing on vanilla RCU
that features tight in-kernel loops doing call_rcu() without bounds on
memory in flight, and RCU routinely survives this in a VM that is given
only 512MB of memory.  In fact, any failure to survive this is considered
a bug to be fixed.

So I suspect RCU is quite capable of handling this load.  But how many
additional kfree_rcu() calls are you anticipating per unit time per CPU?
For example, could we simply measure the rate at which pagecache folios
are currently being freed under heavy load?  Given that information,
we could just try it out.

> Do you put hard limits on how long we can go before an RCU grace period
> that will limit the amount of memory stranded to something acceptable?
> Yes or no?

C'mon, Kent, you can do better than this.

Given what you have described, I believe that RCU would have no problem
with it.  However, additional information would be welcome.  As always,
I could well be missing something.

And yes, kernel developers can break RCU (with infinite loops in RCU
readers being a popular approach) and systems administrators can break
RCU, for example, by confining processing of offloading callbacks to a
single CPU on a 128-CPU system.  Don't do that.

							Thanx, Paul