Re: [PATCH 3/5] mm: BUG_ON to avoid NULL deference while __GFP_NOFAIL fails

Barry Song <21cnbao@xxxxxxxxx> · Wed, 24 Jul 2024 22:11:47 +1200

On Wed, Jul 24, 2024 at 10:03 PM Vlastimil Babka <vbabka@xxxxxxx> wrote:
>
> On 7/24/24 10:55 AM, Barry Song wrote:
> > From: Barry Song <v-songbaohua@xxxxxxxx>
> >
> > We have cases we still fail though callers might have __GFP_NOFAIL.
> > Since they don't check the return, we are exposed to the security
> > risks for NULL deference.
> >
> > Though BUG_ON() is not encouraged by Linus, this is an unrecoverable
> > situation.
> >
> > Christoph Hellwig:
> > The whole freaking point of __GFP_NOFAIL is that callers don't handle
> > allocation failures.  So in fact a straight BUG is the right thing
> > here.
> >
> > Vlastimil Babka:
> > It's just not a recoverable situation (WARN_ON is for recoverable
> > situations). The caller cannot handle allocation failure and at the same
> > time asked for an impossible allocation. BUG_ON() is a guaranteed oops
> > with stracktrace etc. We don't need to hope for the later NULL pointer
> > dereference (which might if really unlucky happen from a different
> > context where it's no longer obvious what lead to the allocation failing).
>
> Note that quote was meant specifically for the "too large" allocation, which
> is truly impossible. That includes the kvmalloc_array() overflow, order >
> MAX_ORDER etc.

I equally quote this for two cases because non-sleepable is also returning NULL,
in this means, they are currently facing the same problems.

>
> The "can't sleep/reclaim" is a bit more nuanced as there's the alternative
> in just warning and looping and hoping kswapd or some other direct reclaimer
> saves the day. If yes, great, we have a system that still works and a
> warning to repor. If no, there's still a warning, but later soft/hardlockup
> hits. These might be eventually worse than an immediate BUG_ON so it's not a
> clear cut. At least I think these cases should be handled in two different
> patches and not together.

But I fully agree these two can be separated and judged separately.

After more thinking, I am concerned that this issue might be difficult to be
rescued, as the misuse of GFP_ATOMIC | __GFP_NOFAIL typically occurs
in atomic contexts with strict time requirements. Even if some other
components release memory to satisfy the one busy-looping to obtain
memory, it might already be too late?

>
> > Michal Hocko:
> > Linus tends to be against adding new BUG() calls unless the failure is
> > absolutely unrecoverable (e.g. corrupted data structures etc.). I am
> > not sure how he would look at simply incorrect memory allocator usage to
> > blow up the kernel. Now the argument could be made that those failures
> > could cause subtle memory corruptions or even be exploitable which might
> > be a sufficient reason to stop them early.
> >
> > Cc: Michal Hocko <mhocko@xxxxxxxx>
> > Cc: Uladzislau Rezki (Sony) <urezki@xxxxxxxxx>
> > Cc: Christoph Hellwig <hch@xxxxxxxxxxxxx>
> > Cc: Lorenzo Stoakes <lstoakes@xxxxxxxxx>
> > Cc: Christoph Lameter <cl@xxxxxxxxx>
> > Cc: Pekka Enberg <penberg@xxxxxxxxxx>
> > Cc: David Rientjes <rientjes@xxxxxxxxxx>
> > Cc: Joonsoo Kim <iamjoonsoo.kim@xxxxxxx>
> > Cc: Vlastimil Babka <vbabka@xxxxxxx>
> > Cc: Roman Gushchin <roman.gushchin@xxxxxxxxx>
> > Cc: Hyeonggon Yoo <42.hyeyoo@xxxxxxxxx>
> > Cc: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
> > Cc: Kees Cook <kees@xxxxxxxxxx>
> > Signed-off-by: Barry Song <v-songbaohua@xxxxxxxx>
> > ---
> >  include/linux/slab.h |  4 +++-
> >  mm/page_alloc.c      | 10 +++++-----
> >  mm/util.c            |  1 +
> >  3 files changed, 9 insertions(+), 6 deletions(-)
> >
> > diff --git a/include/linux/slab.h b/include/linux/slab.h
> > index c9cb42203183..4a4d1fdc2afe 100644
> > --- a/include/linux/slab.h
> > +++ b/include/linux/slab.h
> > @@ -827,8 +827,10 @@ kvmalloc_array_node_noprof(size_t n, size_t size, gfp_t flags, int node)
> >  {
> >       size_t bytes;
> >
> > -     if (unlikely(check_mul_overflow(n, size, &bytes)))
> > +     if (unlikely(check_mul_overflow(n, size, &bytes))) {
> > +             BUG_ON(flags & __GFP_NOFAIL);
> >               return NULL;
> > +     }
> >
> >       return kvmalloc_node_noprof(bytes, flags, node);
> >  }
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 45d2f41b4783..4d6af00fccd4 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -4435,11 +4435,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> >        */
> >       if (gfp_mask & __GFP_NOFAIL) {
> >               /*
> > -              * All existing users of the __GFP_NOFAIL are blockable, so warn
> > -              * of any new users that actually require GFP_NOWAIT
> > +              * All existing users of the __GFP_NOFAIL are blockable
> > +              * otherwise we introduce a busy loop with inside the page
> > +              * allocator from non-sleepable contexts
> >                */
> > -             if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask))
> > -                     goto fail;
> > +             BUG_ON(!can_direct_reclaim);
> >
> >               /*
> >                * PF_MEMALLOC request from this context is rather bizarre
> > @@ -4470,7 +4470,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> >               cond_resched();
> >               goto retry;
> >       }
> > -fail:
> > +
> >       warn_alloc(gfp_mask, ac->nodemask,
> >                       "page allocation failure: order:%u", order);
> >  got_pg:
> > diff --git a/mm/util.c b/mm/util.c
> > index 0ff5898cc6de..a1be50c243f1 100644
> > --- a/mm/util.c
> > +++ b/mm/util.c
> > @@ -668,6 +668,7 @@ void *__kvmalloc_node_noprof(DECL_BUCKET_PARAMS(size, b), gfp_t flags, int node)
> >       /* Don't even allow crazy sizes */
> >       if (unlikely(size > INT_MAX)) {
> >               WARN_ON_ONCE(!(flags & __GFP_NOWARN));
> > +             BUG_ON(flags & __GFP_NOFAIL);
> >               return NULL;
> >       }
> >
>