Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU

Dave Chinner <david@xxxxxxxxxxxxx> · Fri, 1 Mar 2024 16:54:55 +1100

On Fri, Mar 01, 2024 at 01:16:18PM +1100, NeilBrown wrote:
> On Thu, 29 Feb 2024, Matthew Wilcox wrote:
> > On Tue, Feb 27, 2024 at 09:19:47PM +0200, Amir Goldstein wrote:
> > > On Tue, Feb 27, 2024 at 8:56 PM Paul E. McKenney <paulmck@xxxxxxxxxx> wrote:
> > > >
> > > > Hello!
> > > >
> > > > Recent discussions [1] suggest that greater mutual understanding between
> > > > memory reclaim on the one hand and RCU on the other might be in order.
> > > >
> > > > One possibility would be an open discussion.  If it would help, I would
> > > > be happy to describe how RCU reacts and responds to heavy load, along with
> > > > some ways that RCU's reactions and responses could be enhanced if needed.
> > > >
> > > 
> > > Adding fsdevel as this should probably be a cross track session.
> > 
> > Perhaps broaden this slightly.  On the THP Cabal call we just had a
> > conversation about the requirements on filesystems in the writeback
> > path.  We currently tell filesystem authors that the entire writeback
> > path must avoid allocating memory in order to prevent deadlock (or use
> > GFP_MEMALLOC).  Is this appropriate?  It's a lot of work to assure that
> > writing pagecache back will not allocate memory in, eg, the network stack,
> > the device driver, and any other layers the write must traverse.
> > 
> > With the removal of ->writepage from vmscan, perhaps we can make
> > filesystem authors lives easier by relaxing this requirement as pagecache
> > should be cleaned long before we get to reclaiming it.
> > 
> > I don't think there's anything to be done about swapping anon memory.
> > We probably don't want to proactively write anon memory to swap, so by
> > the time we're in ->swap_rw we really are low on memory.
> > 
> > 
> 
> While we are considering revising mm rules, I would really like to
> revised the rule that GFP_KERNEL allocations are allowed to fail.
> I'm not at all sure that they ever do (except for large allocations - so
> maybe we could leave that exception in - or warn if large allocations
> are tried without a MAY_FAIL flag).
> 
> Given that GFP_KERNEL can wait, and that the mm can kill off processes
> and clear cache to free memory, there should be no case where failure is
> needed or when simply waiting will eventually result in success.  And if
> there is, the machine is a gonner anyway.

Yes, please!

XFS was designed and implemented on an OS that gave this exact
guarantee for kernel allocations back in the early 1990s.  Memory
allocation simply blocked until it succeeded unless the caller
indicated they could handle failure. That's what __GFP_NOFAIL does
and XFS is still heavily dependent on this behaviour.

And before people scream "but that was 30 years ago, Unix OS code
was much simpler", consider that Irix supported machines with
hundreds of NUMA nodes, thousands of CPUs, terabytes of memory and
petabytes of storage. It had variable size high order pages in the
page cache (something we've only just got with folios!), page
migration, page compaction, memory and process locality control,
filesystem block sizes larger than page size (which we don't have
yet!), memory shrinkers for subsystem cache reclaim, page cache
dirty throttling to sustained writeback IO rates, etc.

Lots of the mm technology from that OS has been re-implemented in
Linux in the past two decades, but in several important ways Linux
still falls shy of the bar that Irix set a couple of decades ago.
One of those is the kernel memory allocation guarantee.

> Once upon a time user-space pages could not be ripped out of a process
> by the oom killer until the process actually exited, and that meant that
> GFP_KERNEL allocations of a process being oom killed should not block
> indefinitely in the allocator.  I *think* that isn't the case any more.
> 
> Insisting that GFP_KERNEL allocations never returned NULL would allow us
> to remove a lot of untested error handling code....

This is the sort of thing I was thinking of in the "remove
GFP_NOFS" discussion thread when I said this to Kent:

	"We need to start designing our code in a way that doesn't require
	extensive testing to validate it as correct. If the only way to
	validate new code is correct is via stochastic coverage via error
	injection, then that is a clear sign we've made poor design choices
	along the way."

https://lore.kernel.org/linux-fsdevel/ZcqWh3OyMGjEsdPz@xxxxxxxxxxxxxxxxxxx/

If memory allocation doesn't fail by default, then we can remove the
vast majority of allocation error handling from the kernel. Make the
common case just work - remove the need for all that code to handle
failures that is hard to exercise reliably and so are rarely tested.

A simple change to make long standing behaviour an actual policy we
can rely on means we can remove both code and test matrix overhead -
it's a win-win IMO.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx