Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO

Kent Overstreet <kent.overstreet@xxxxxxxxx> · Wed, 28 Feb 2024 18:55:12 -0500

On Tue, Feb 27, 2024 at 09:58:48AM -0800, Paul E. McKenney wrote:
> On Tue, Feb 27, 2024 at 11:34:06AM -0500, Kent Overstreet wrote:
> > On Tue, Feb 27, 2024 at 08:21:29AM -0800, Paul E. McKenney wrote:
> > > On Tue, Feb 27, 2024 at 03:54:23PM +0000, Matthew Wilcox wrote:
> > > > On Tue, Feb 27, 2024 at 07:32:32AM -0800, Paul E. McKenney wrote:
> > > > > At a ridiculously high level, reclaim is looking for memory to free.
> > > > > Some read-only memory can often be dropped immediately on the grounds
> > > > > that its data can be read back in if needed.  Other memory can only be
> > > > > dropped after being written out, which involves a delay.  There are of
> > > > > course many other complications, but this will do for a start.
> > > > 
> > > > Hi Paul,
> > > > 
> > > > I appreciate the necessity of describing what's going on at a very high
> > > > level, but there's a wrinkle that I'm not sure you're aware of which
> > > > may substantially change your argument.
> > > > 
> > > > For anonymous memory, we do indeed wait until reclaim to start writing it
> > > > to swap.  That may or may not be the right approach given how anonymous
> > > > memory is used (and could be the topic of an interesting discussion
> > > > at LSFMM).
> > > > 
> > > > For file-backed memory, we do not write back memory in reclaim.  If it
> > > > has got to the point of calling ->writepage in vmscan, things have gone
> > > > horribly wrong to the point where calling ->writepage will make things
> > > > worse.  This is why we're currently removing ->writepage from every
> > > > filesystem (only ->writepages will remain).  Instead, the page cache
> > > > is written back much earlier, once we get to balance_dirty_pages().
> > > > That lets us write pages in filesystem-friendly ways instead of in MM
> > > > LRU order.
> > > 
> > > Thank you for the additional details.
> > > 
> > > But please allow me to further summarize the point of my prior email
> > > that seems to be getting lost:
> > > 
> > > 1.	RCU already does significant work prodding grace periods.
> > > 
> > > 2.	There is no reasonable way to provide estimates of the
> > > 	memory sent to RCU via call_rcu(), and in many cases
> > > 	the bulk of the waiting memory will be call_rcu() memory.
> > > 
> > > Therefore, if we cannot come up with a heuristic that does not need to
> > > know the bytes of memory waiting, we are stuck anyway.
> > 
> > That is a completely asinine argument.
> 
> Huh.  Anything else you need to get off your chest?
> 
> On the off-chance it is unclear, I do disagree with your assessment.
> 
> > > So perhaps the proper heuristic for RCU speeding things up is simply
> > > "Hey RCU, we are in reclaim!".
> > 
> > Because that's the wrong heuristic. There are important workloads for
> > which  we're _always_ in reclaim, but as long as RCU grace periods are
> > happening at some steady rate, the amount of memory stranded will be
> > bounded and there's no reason to expedite grace periods.
> 
> RCU is in fact designed to handle heavy load, and contains a number of
> mechanisms to operate more efficiently at higher load than at lower load.
> It also contains mechanisms to expedite grace periods under heavy load.
> Some of which I already described in earlier emails on this thread.

yeah, the synchronize_rcu_expedited() souns like exactly what we need
here, when memory reclaim notices too much memory is stranded

> > If we start RCU freeing all pagecache folios we're going to be cycling
> > memory through RCU freeing at the rate of gigabytes per second, tens of
> > gigabytes per second on high end systems.
> 
> The load on RCU would be measured in terms of requests (kfree_rcu() and
> friends) per unit time per CPU, not in terms of gigabytes per unit time.
> Of course, the amount of memory per unit time might be an issue for
> whatever allocators you are using, and as Matthew has often pointed out,
> the deferred reclamation incurs additional cache misses.

So what I'm saying is that in the absensce of something noticing
excessive memory being stranded and asking for an expedited grace
period, the only bounds on the amount of memory being stranded will be
how often RCU grace periods expire in the absence of anyone asking for
them - that was my question to you. I'm not at all knowledgable on RCU
internals but I gather it varies with things like dynticks and whether
or not userspace is calling into the kernel?

"gigabytes per second" - if userspace is doing big sequential streaming
reads that don't fit into cache, we'll be evicting pagecache as quickly
as we can read it in, so we should only be limited by your SSD
bandwidth.

> And rcutorture really does do forward-progress testing on vanilla RCU
> that features tight in-kernel loops doing call_rcu() without bounds on
> memory in flight, and RCU routinely survives this in a VM that is given
> only 512MB of memory.  In fact, any failure to survive this is considered
> a bug to be fixed.

Are you saying there's already feedback between memory reclaim and RCU?

> So I suspect RCU is quite capable of handling this load.  But how many
> additional kfree_rcu() calls are you anticipating per unit time per CPU?
> For example, could we simply measure the rate at which pagecache folios
> are currently being freed under heavy load?  Given that information,
> we could just try it out.

It's not the load I'm concerned about, or the number of call_rcu()
calls, I have no doubt that RCU will cope with that just fine.

But I do think that we need an additional feedback mechanism here. When
we're doing big streaming sequential buffered IO, and lots of memory is
cycled in and out of the pagecache, we have a couple things we want to
avoid:

The main thing is that we don't want the amount of memory stranded
waiting for RCU to grow unbounded and shove everything out of the
caches; if you're currently just concerned about _deadlock_ that is
likely insufficient here, we're also concerned about maintaining good
steady performance under load

We don't want memory reclaim to be trying harder and harder when the
correct thing to do is a synchronize_rcu_expedited().

We _also_ don't want to be hammering on RCU asking for expedited grace
periods unnecessarily when the number of pending callbacks is high, but
they're all for unrelated stuff - expedited RCU grace periods aren't
free either!.

Does that help clarify things?