Re: [PATCH v2 2/4] mm/vmalloc: add support for __GFP_NOFAIL

"NeilBrown" <neilb@xxxxxxx> · Wed, 24 Nov 2021 16:23:31 +1100

On Wed, 24 Nov 2021, Andrew Morton wrote:
> On Wed, 24 Nov 2021 14:16:56 +1100 "NeilBrown" <neilb@xxxxxxx> wrote:
> 
> > On Wed, 24 Nov 2021, Andrew Morton wrote:
> > > 
> > > I added GFP_NOFAIL back in the mesozoic era because quite a lot of
> > > sites were doing open-coded try-forever loops.  I thought "hey, they
> > > shouldn't be doing that in the first place, but let's at least
> > > centralize the concept to reduce code size, code duplication and so
> > > it's something we can now grep for".  But longer term, all GFP_NOFAIL
> > > sites should be reworked to no longer need to do the retry-forever
> > > thing.  In retrospect, this bright idea of mine seems to have added
> > > license for more sites to use retry-forever.  Sigh.
> > 
> > One of the costs of not having GFP_NOFAIL (or similar) is lots of
> > untested failure-path code.
> 
> Well that's bad of the relevant developers and testers!  It isn't that
> hard to fake up allocation failures.  Either with the formal fault
> injection framework or with ad-hackery.

If we have CONFIG_RANDOM_ALLOC_PAGE_FAILURE then I might agree.
lockdep is AWESOME.  It makes testing for deadlock problems *so* easy.  
That is the level of ease-of-use we need if you want people to handle
page-alloc failures reliably.

> 
> > When does an allocation that is allowed to retry and reclaim ever fail
> > anyway? I think the answer is "only when it has been killed by the oom
> > killer".  That of course cannot happen to kernel threads, so maybe
> > kernel threads should never need GFP_NOFAIL??
> 
> > I'm not sure the above is 100%, but I do think that is the sort of
> > semantic that we want.  We want to know what kmalloc failure *means*.
> > We also need well defined and documented strategies to handle it.
> > mempools are one such strategy, but not always suitable.
> 
> Well, mempools aren't "handling" it.  They're just another layer to
> make memory allocation attempts appear to be magical.  The preferred
> answer is "just handle the damned error and return ENOMEM".

No.  mempools are EXACTLY handling it - in a specific context.  When
writing out dirty pages so as to free up memory, you often need to
allocate memory.  And you may need a sequence of allocations, typically
at different levels in the stack.  Global water-marks cannot help
reliably as it might all be used up with top-level requests, and the
lower levels can starve indefinitely.  mempools ensure that when memory
is freed, it is returned to the same level it was allocated for.  That
ensures you can get at least one request at a time all the way out to
storage.

If swap-out just returned ENOMEM, you'd really be in trouble.

> 
> Obviously this gets very painful at times (arguably because of
> high-level design shortcomings).  The old radix_tree_preload approach
> was bulletproof, but was quite a lot of fuss.

It would get particularly painful if some system call started returned
-ENOMEM, which had never returned that before.  I note that ext4 uses
__GFP_NOFAIL when handling truncate.  I don't think user-space would be
happy with ENOMEM from truncate (or fallocate(PUNHC_HOLE)), though a
recent commit which adds it focuses more on wanting to avoid the need
for fsck.

> 
> > preallocating can also be useful but can be clumsy to implement.  Maybe
> > we should support a process preallocating a bunch of pages which can
> > only be used by the process - and are auto-freed when the process
> > returns to user-space.  That might allow the "error paths" to be simple
> > and early, and subsequent allocations that were GFP_USEPREALLOC would be
> > safe.
> 
> Yes, I think something like that would be quite feasible.  Need to
> prevent interrupt code from stealing the task's page store.
> 
> It can be a drag calculating (by hand) what the max possible amount of
> allocation will be and one can end up allocating and then not using a
> lot of memory.

CONFIG_DEBUG_PREALLOC could force every GFP_USE_PREALLOC request to use
a different page, and warn if there weren't enough.  Not perfect, but
useful.

Concerning the "lot of memory" having prealloc as an easy option means
people can use it until it becomes too expensive, then find a better
solution.  There will likely always be a better solution, but it might
not often be worth the effort.

> 
> I forget why radix_tree_preload used a cpu-local store rather than a
> per-task one.
> 
> Plus "what order pages would you like" and "on which node" and "in
> which zone", etc...

"what order" - only order-0 I hope.  I'd hazard a guess that 90% of
current NOFAIL allocations only need one page (providing slub is used -
slab seems to insist on high-order pages sometimes).
"which node" - whichever.  Unless __GFP_HARDWALL is set, alloc_page()
will fall-back to "whichever" anyway, and NOFAIL with HARDWALL is
probably a poor choice.
"which zone" - NORMAL.  I cannot find any NOFAIL allocations that want
DMA.  fs/ntfs asks for __GFP_HIGHMEM with NOFAIL, but that that doesn't
*requre* highmem.

Of course, before designing this interface too precisely we should check
if anyone can use it.  From a quick through the some of the 100-ish
users of __GFP_NOFAIL I'd guess that mempools would help - the
preallocation should happen at init-time, not request-time.  Maybe if we
made mempools even more light weight .... though that risks allocating a
lot of memory that will never get used.

This brings me back to the idea that
    alloc_page(wait and reclaim allowed)
should only fail on OOM_KILL.  That way kernel threads are safe, and
user-threads are free to return ENOMEM knowing it won't get to
user-space.  If user-thread code really needs NOFAIL, it punts to a
workqueue and waits - aborting the wait if it is killed, while the work
item still runs eventually.

NeilBrown