Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU

Kent Overstreet <kent.overstreet@xxxxxxxxx> · Thu, 29 Feb 2024 21:39:17 -0500

On Fri, Mar 01, 2024 at 01:16:18PM +1100, NeilBrown wrote:
> On Thu, 29 Feb 2024, Matthew Wilcox wrote:
> > On Tue, Feb 27, 2024 at 09:19:47PM +0200, Amir Goldstein wrote:
> > > On Tue, Feb 27, 2024 at 8:56 PM Paul E. McKenney <paulmck@xxxxxxxxxx> wrote:
> > > >
> > > > Hello!
> > > >
> > > > Recent discussions [1] suggest that greater mutual understanding between
> > > > memory reclaim on the one hand and RCU on the other might be in order.
> > > >
> > > > One possibility would be an open discussion.  If it would help, I would
> > > > be happy to describe how RCU reacts and responds to heavy load, along with
> > > > some ways that RCU's reactions and responses could be enhanced if needed.
> > > >
> > > 
> > > Adding fsdevel as this should probably be a cross track session.
> > 
> > Perhaps broaden this slightly.  On the THP Cabal call we just had a
> > conversation about the requirements on filesystems in the writeback
> > path.  We currently tell filesystem authors that the entire writeback
> > path must avoid allocating memory in order to prevent deadlock (or use
> > GFP_MEMALLOC).  Is this appropriate?  It's a lot of work to assure that
> > writing pagecache back will not allocate memory in, eg, the network stack,
> > the device driver, and any other layers the write must traverse.
> > 
> > With the removal of ->writepage from vmscan, perhaps we can make
> > filesystem authors lives easier by relaxing this requirement as pagecache
> > should be cleaned long before we get to reclaiming it.
> > 
> > I don't think there's anything to be done about swapping anon memory.
> > We probably don't want to proactively write anon memory to swap, so by
> > the time we're in ->swap_rw we really are low on memory.
> > 
> > 
> 
> While we are considering revising mm rules, I would really like to
> revised the rule that GFP_KERNEL allocations are allowed to fail.
> I'm not at all sure that they ever do (except for large allocations - so
> maybe we could leave that exception in - or warn if large allocations
> are tried without a MAY_FAIL flag).
> 
> Given that GFP_KERNEL can wait, and that the mm can kill off processes
> and clear cache to free memory, there should be no case where failure is
> needed or when simply waiting will eventually result in success.  And if
> there is, the machine is a gonner anyway.
> 
> Once upon a time user-space pages could not be ripped out of a process
> by the oom killer until the process actually exited, and that meant that
> GFP_KERNEL allocations of a process being oom killed should not block
> indefinitely in the allocator.  I *think* that isn't the case any more.
> 
> Insisting that GFP_KERNEL allocations never returned NULL would allow us
> to remove a lot of untested error handling code....

If memcg ever gets enabled for all kernel side allocations we might
start seeing failures of GFP_KERNEL allocations.

I've got better fault injection code coming, I'll be posting it right
after memory allocation profiling gets merged - that'll help with the
testing situation.

The big blocker on enabling memcg for all kernel allocations is
performance overhead, but I hear that's getting worked on as well.

We'd probably want to add a gfp flag to annotate which allocations we
want to fail because of memcg, though...