Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU

"NeilBrown" <neilb@xxxxxxx> · Sat, 02 Mar 2024 10:47:59 +1100

On Sat, 02 Mar 2024, Kent Overstreet wrote:
> On Fri, Mar 01, 2024 at 04:54:55PM +1100, Dave Chinner wrote:
> > On Fri, Mar 01, 2024 at 01:16:18PM +1100, NeilBrown wrote:
> > > While we are considering revising mm rules, I would really like to
> > > revised the rule that GFP_KERNEL allocations are allowed to fail.
> > > I'm not at all sure that they ever do (except for large allocations - so
> > > maybe we could leave that exception in - or warn if large allocations
> > > are tried without a MAY_FAIL flag).
> > > 
> > > Given that GFP_KERNEL can wait, and that the mm can kill off processes
> > > and clear cache to free memory, there should be no case where failure is
> > > needed or when simply waiting will eventually result in success.  And if
> > > there is, the machine is a gonner anyway.
> > 
> > Yes, please!
> > 
> > XFS was designed and implemented on an OS that gave this exact
> > guarantee for kernel allocations back in the early 1990s.  Memory
> > allocation simply blocked until it succeeded unless the caller
> > indicated they could handle failure. That's what __GFP_NOFAIL does
> > and XFS is still heavily dependent on this behaviour.
> 
> I'm not saying we should get rid of __GFP_NOFAIL - actually, I'd say
> let's remove the underscores and get rid of the silly two page limit.
> GFP_NOFAIL|GFP_KERNEL is perfectly safe for larger allocations, as long
> as you don't mind possibly waiting a bit.
> 
> But it can't be the default because, like I mentioned to Neal, there are
> a _lot_ of different places where we allocate memory in the kernel, and
> they have to be able to fail instead of shoving everything else out of
> memory.
> 
> > This is the sort of thing I was thinking of in the "remove
> > GFP_NOFS" discussion thread when I said this to Kent:
> > 
> > 	"We need to start designing our code in a way that doesn't require
> > 	extensive testing to validate it as correct. If the only way to
> > 	validate new code is correct is via stochastic coverage via error
> > 	injection, then that is a clear sign we've made poor design choices
> > 	along the way."
> > 
> > https://lore.kernel.org/linux-fsdevel/ZcqWh3OyMGjEsdPz@xxxxxxxxxxxxxxxxxxx/
> > 
> > If memory allocation doesn't fail by default, then we can remove the
> > vast majority of allocation error handling from the kernel. Make the
> > common case just work - remove the need for all that code to handle
> > failures that is hard to exercise reliably and so are rarely tested.
> > 
> > A simple change to make long standing behaviour an actual policy we
> > can rely on means we can remove both code and test matrix overhead -
> > it's a win-win IMO.
> 
> We definitely don't want to make GFP_NOIO/GFP_NOFS allocations nofail by
> default - a great many of those allocations have mempools in front of
> them to avoid deadlocks, and if you do that you've made the mempools
> useless.
> 

Not strictly true.  mempool_alloc() adds __GFP_NORETRY so the allocation
will certainly fail if that is appropriate.

I suspect that most places where there is a non-error fallback already
use NORETRY or RETRY_MAYFAIL or similar.

But I agree that changing the meaning of GFP_KERNEL has a potential to
cause problems.  I support promoting "GFP_NOFAIL" which should work at
least up to PAGE_ALLOC_COSTLY_ORDER (8 pages).
I'm unsure how it should be have in PF_MEMALLOC_NOFS and
PF_MEMALLOC_NOIO context.  I suspect Dave would tell me it should work in
these contexts, in which case I'm sure it should.

Maybe we could then deprecate GFP_KERNEL.

Thanks,
NeilBrown