Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Reclamation interactions with RCU

"NeilBrown" <neilb@xxxxxxx> · Mon, 04 Mar 2024 09:45:48 +1100

On Sat, 02 Mar 2024, Kent Overstreet wrote:
> On Sat, Mar 02, 2024 at 10:47:59AM +1100, NeilBrown wrote:
> > On Sat, 02 Mar 2024, Kent Overstreet wrote:
> > > On Fri, Mar 01, 2024 at 04:54:55PM +1100, Dave Chinner wrote:
> > > > On Fri, Mar 01, 2024 at 01:16:18PM +1100, NeilBrown wrote:
> > > > > While we are considering revising mm rules, I would really like to
> > > > > revised the rule that GFP_KERNEL allocations are allowed to fail.
> > > > > I'm not at all sure that they ever do (except for large allocations - so
> > > > > maybe we could leave that exception in - or warn if large allocations
> > > > > are tried without a MAY_FAIL flag).
> > > > > 
> > > > > Given that GFP_KERNEL can wait, and that the mm can kill off processes
> > > > > and clear cache to free memory, there should be no case where failure is
> > > > > needed or when simply waiting will eventually result in success.  And if
> > > > > there is, the machine is a gonner anyway.
> > > > 
> > > > Yes, please!
> > > > 
> > > > XFS was designed and implemented on an OS that gave this exact
> > > > guarantee for kernel allocations back in the early 1990s.  Memory
> > > > allocation simply blocked until it succeeded unless the caller
> > > > indicated they could handle failure. That's what __GFP_NOFAIL does
> > > > and XFS is still heavily dependent on this behaviour.
> > > 
> > > I'm not saying we should get rid of __GFP_NOFAIL - actually, I'd say
> > > let's remove the underscores and get rid of the silly two page limit.
> > > GFP_NOFAIL|GFP_KERNEL is perfectly safe for larger allocations, as long
> > > as you don't mind possibly waiting a bit.
> > > 
> > > But it can't be the default because, like I mentioned to Neal, there are
> > > a _lot_ of different places where we allocate memory in the kernel, and
> > > they have to be able to fail instead of shoving everything else out of
> > > memory.
> > > 
> > > > This is the sort of thing I was thinking of in the "remove
> > > > GFP_NOFS" discussion thread when I said this to Kent:
> > > > 
> > > > 	"We need to start designing our code in a way that doesn't require
> > > > 	extensive testing to validate it as correct. If the only way to
> > > > 	validate new code is correct is via stochastic coverage via error
> > > > 	injection, then that is a clear sign we've made poor design choices
> > > > 	along the way."
> > > > 
> > > > https://lore.kernel.org/linux-fsdevel/ZcqWh3OyMGjEsdPz@xxxxxxxxxxxxxxxxxxx/
> > > > 
> > > > If memory allocation doesn't fail by default, then we can remove the
> > > > vast majority of allocation error handling from the kernel. Make the
> > > > common case just work - remove the need for all that code to handle
> > > > failures that is hard to exercise reliably and so are rarely tested.
> > > > 
> > > > A simple change to make long standing behaviour an actual policy we
> > > > can rely on means we can remove both code and test matrix overhead -
> > > > it's a win-win IMO.
> > > 
> > > We definitely don't want to make GFP_NOIO/GFP_NOFS allocations nofail by
> > > default - a great many of those allocations have mempools in front of
> > > them to avoid deadlocks, and if you do that you've made the mempools
> > > useless.
> > > 
> > 
> > Not strictly true.  mempool_alloc() adds __GFP_NORETRY so the allocation
> > will certainly fail if that is appropriate.
> 
> *nod* 
> 
> > I suspect that most places where there is a non-error fallback already
> > use NORETRY or RETRY_MAYFAIL or similar.
> 
> NORETRY and RETRY_MAYFAIL actually weren't on my radar, and I don't see
> _tons_ of uses for either of them - more for NORETRY.
> 
> My go-to is NOWAIT in this scenario though; my common pattern is "try
> nonblocking with locks held, then drop locks and retry GFP_KERNEL".
>  
> > But I agree that changing the meaning of GFP_KERNEL has a potential to
> > cause problems.  I support promoting "GFP_NOFAIL" which should work at
> > least up to PAGE_ALLOC_COSTLY_ORDER (8 pages).
> 
> I'd support this change.
> 
> > I'm unsure how it should be have in PF_MEMALLOC_NOFS and
> > PF_MEMALLOC_NOIO context.  I suspect Dave would tell me it should work in
> > these contexts, in which case I'm sure it should.
> > 
> > Maybe we could then deprecate GFP_KERNEL.
> 
> What do you have in mind?

I have in mind a more explicit statement of how much waiting is
acceptable.

GFP_NOFAIL - wait indefinitely
GFP_KILLABLE - wait indefinitely unless fatal signal is pending.
GFP_RETRY - may retry but deadlock, though unlikely, is possible.  So
            don't wait indefinitely.  May abort more quickly if fatal
            signal is pending.
GFP_NO_RETRY - only try things once.  This may sleep, but will give up
            fairly quickly.  Either deadlock is a significant
            possibility, or alternate strategy is fairly cheap.
GFP_ATOMIC - don't sleep - same as current.

I don't see how "GFP_KERNEL" fits into that spectrum.  The definition of
"this will try really hard, but might fail and we can't really tell you
what circumstances it might fail in" isn't fun to work with.

Thanks,
NeilBrown

> 
> Deprecating GFP_NOFS and GFP_NOIO would be wonderful - those should
> really just be PF_MEMALLOC_NOFS and PF_MEMALLOC_NOIO, now that we're
> pushing for memalloc_flags_(save|restore) more.
> 
> Getting rid of those would be a really nice cleanup beacuse then gfp
> flags would mostly just be:
>  - the type of memory to allocate (highmem, zeroed, etc.)
>  - how hard to try (don't block at all, block some, block forever)
>