Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU

"NeilBrown" <neilb@xxxxxxx> · Fri, 01 Mar 2024 13:16:18 +1100

On Thu, 29 Feb 2024, Matthew Wilcox wrote:
> On Tue, Feb 27, 2024 at 09:19:47PM +0200, Amir Goldstein wrote:
> > On Tue, Feb 27, 2024 at 8:56 PM Paul E. McKenney <paulmck@xxxxxxxxxx> wrote:
> > >
> > > Hello!
> > >
> > > Recent discussions [1] suggest that greater mutual understanding between
> > > memory reclaim on the one hand and RCU on the other might be in order.
> > >
> > > One possibility would be an open discussion.  If it would help, I would
> > > be happy to describe how RCU reacts and responds to heavy load, along with
> > > some ways that RCU's reactions and responses could be enhanced if needed.
> > >
> > 
> > Adding fsdevel as this should probably be a cross track session.
> 
> Perhaps broaden this slightly.  On the THP Cabal call we just had a
> conversation about the requirements on filesystems in the writeback
> path.  We currently tell filesystem authors that the entire writeback
> path must avoid allocating memory in order to prevent deadlock (or use
> GFP_MEMALLOC).  Is this appropriate?  It's a lot of work to assure that
> writing pagecache back will not allocate memory in, eg, the network stack,
> the device driver, and any other layers the write must traverse.
> 
> With the removal of ->writepage from vmscan, perhaps we can make
> filesystem authors lives easier by relaxing this requirement as pagecache
> should be cleaned long before we get to reclaiming it.
> 
> I don't think there's anything to be done about swapping anon memory.
> We probably don't want to proactively write anon memory to swap, so by
> the time we're in ->swap_rw we really are low on memory.
> 
> 

While we are considering revising mm rules, I would really like to
revised the rule that GFP_KERNEL allocations are allowed to fail.
I'm not at all sure that they ever do (except for large allocations - so
maybe we could leave that exception in - or warn if large allocations
are tried without a MAY_FAIL flag).

Given that GFP_KERNEL can wait, and that the mm can kill off processes
and clear cache to free memory, there should be no case where failure is
needed or when simply waiting will eventually result in success.  And if
there is, the machine is a gonner anyway.

Once upon a time user-space pages could not be ripped out of a process
by the oom killer until the process actually exited, and that meant that
GFP_KERNEL allocations of a process being oom killed should not block
indefinitely in the allocator.  I *think* that isn't the case any more.

Insisting that GFP_KERNEL allocations never returned NULL would allow us
to remove a lot of untested error handling code....

NeilBrown