Re: [PATCH v3 4/4] mm: prohibit NULL deference exposed for unsupported non-blockable __GFP_NOFAIL

Barry Song <21cnbao@xxxxxxxxx> · Sun, 18 Aug 2024 19:25:40 +1200

On Sun, Aug 18, 2024 at 7:07 PM Yafang Shao <laoar.shao@xxxxxxxxx> wrote:
>
> On Sun, Aug 18, 2024 at 2:45 PM Barry Song <21cnbao@xxxxxxxxx> wrote:
> >
> > On Sun, Aug 18, 2024 at 6:27 PM Barry Song <21cnbao@xxxxxxxxx> wrote:
> > >
> > > On Sun, Aug 18, 2024 at 5:51 PM Yafang Shao <laoar.shao@xxxxxxxxx> wrote:
> > > >
> > > > On Sun, Aug 18, 2024 at 11:48 AM Barry Song <21cnbao@xxxxxxxxx> wrote:
> > > > >
> > > > > On Sun, Aug 18, 2024 at 2:55 PM Yafang Shao <laoar.shao@xxxxxxxxx> wrote:
> > > > > >
> > > > > > On Sat, Aug 17, 2024 at 2:25 PM Barry Song <21cnbao@xxxxxxxxx> wrote:
> > > > > > >
> > > > > > > From: Barry Song <v-songbaohua@xxxxxxxx>
> > > > > > >
> > > > > > > When users allocate memory with the __GFP_NOFAIL flag, they might
> > > > > > > incorrectly use it alongside GFP_ATOMIC, GFP_NOWAIT, etc.  This kind of
> > > > > > > non-blockable __GFP_NOFAIL is not supported and is pointless.  If we
> > > > > > > attempt and still fail to allocate memory for these users, we have two
> > > > > > > choices:
> > > > > > >
> > > > > > >     1. We could busy-loop and hope that some other direct reclamation or
> > > > > > >     kswapd rescues the current process. However, this is unreliable
> > > > > > >     and could ultimately lead to hard or soft lockups,
> > > > > >
> > > > > > That can occur even if we set both __GFP_NOFAIL and
> > > > > > __GFP_DIRECT_RECLAIM, right? So, I don't believe the issue is related
> > > > > > to setting __GFP_DIRECT_RECLAIM; rather, it stems from the flawed
> > > > > > design of __GFP_NOFAIL itself.
> > > > >
> > > > > the point of GFP_NOFAIL is that it won't fail and its user won't check
> > > > > the return value. without direct_reclamation, it is sometimes impossible.
> > > > > but with direct reclamation, users constantly wait and finally they can
> > > >
> > > > So, what exactly is the difference between 'constantly waiting' and
> > > > 'busy looping'? Could you please clarify? Also, why can't we
> > > > 'constantly wait' when __GFP_DIRECT_RECLAIM is not set?
> > >
> > > I list two options in changelog
> > > 1: busy loop 2. bug_on. I am actually fine with either one. either one is
> > > better than the existing code. but returning null in the current code
> > > is definitely wrong.
> > >
> > > 1 somehow has the attempt to make __GFP_NOFAIL without direct_reclamation
> > > legal. so it is a bit suspicious going in the wrong direction.
> > >
> > > busy-loop is that you are not reclaiming memory you are not sleeping.
> > > cpu is constantly working and busy, so it might result in a lockup, either
> > > soft lockup or hard lockup.
>
> Thanks for the clarification.
> That can be avoided by a simple cond_resched() if the hard lockup or
> softlockup is the main issue ;)

Is your point to legitimize the combination of gfp_nofail and non-lockable?
I don't think you can simply fix the lockup. A direct example is
single-core, and preemptible kernel. you are calling __gfp_nofail without
direct reclamation in a spinlock,  cond_resched() will do nothing.

it doesn't have to be a single core, for example on phones, cpu hotplug is
used to save power. Sometimes, phones unplug lots of cpus to save
power. and even we are multiple cores, we have cpuset cgroups things
which can prevent async oom from running on other cores.

>
> > >
> > > with direct_reclamation, wait is the case you can sleep. it is not holding
> > > cpu, not a busy loop. in rare case, users might end in endless wait,
> > > but it matches the doc of __GFP_NOFAIL, never return till memory
> > > is gotten (the current code is implemented in this way unless users
> > > incorrectly combine __GFP_NOFAIL with aotmic/nowait etc.)
> > >
> >
> > and the essential difference between "w/ and w/o direct_reclaim": with
> > direct reclaim, the user is actively reclaiming memory to rescue itself
> > by all kinds of possible ways(compact, oom, reclamation), while without
> > direct reclamation, it can do nothing and just loop (busy-loop).
>
> It can wake up kswapd, which can then reclaim memory. If kswapd can't
> keep up, the system is likely under heavy memory pressure. In such a
> case, it makes little difference whether __GFP_DIRECT_RECLAIM is set
> or not. For reference, see the old issue:
> https://lore.kernel.org/lkml/d9802b6a-949b-b327-c4a6-3dbca485ec20@xxxxxxx/.
>
> I believe the core issue persists, and the design of __GFP_NOFAIL
> exacerbates it.
>
> By the way, I believe we could trigger an asynchronous OOM kill in the
> case without direct reclaim to avoid busy looping.

This async oom is sometimes impossible with the same reason as
the above example.

>
> >
> > > note, long-term we won't expose __GFP_NOFAIL any more. we
> > > will only expose GFP_NOFAIL which enforces Blockable.  I am
> > > quite busy on other issues, so this won't happen in a short time.
> > >
> > > >
> > > > > get memory. if you read the doc of __GFP_NOFAIL you will find it.
> > > > > it is absolutely clearly documented.
> > > > >
> > > > > >
> > > > > > > which might not
> > > > > > >     be well supported by some architectures.
> > > > > > >
> > > > > > >     2. We could use BUG_ON to trigger a reliable system crash, avoiding
> > > > > > >     exposing NULL dereference.
> > > > > > >
> > > > > > > Neither option is ideal, but both are improvements over the existing code.
> > > > > > > This patch selects the second option because, with the introduction of
> > > > > > > scoped API and GFP_NOFAIL—capable of enforcing direct reclamation for
> > > > > > > nofail users(which is in my plan), non-blockable nofail allocations will
> > > > > > > no longer be possible.
> > > > > > >
> > > > > > > Signed-off-by: Barry Song <v-songbaohua@xxxxxxxx>
> > > > > > > Cc: Michal Hocko <mhocko@xxxxxxxx>
> > > > > > > Cc: Uladzislau Rezki (Sony) <urezki@xxxxxxxxx>
> > > > > > > Cc: Christoph Hellwig <hch@xxxxxxxxxxxxx>
> > > > > > > Cc: Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx>
> > > > > > > Cc: Christoph Lameter <cl@xxxxxxxxx>
> > > > > > > Cc: Pekka Enberg <penberg@xxxxxxxxxx>
> > > > > > > Cc: David Rientjes <rientjes@xxxxxxxxxx>
> > > > > > > Cc: Joonsoo Kim <iamjoonsoo.kim@xxxxxxx>
> > > > > > > Cc: Vlastimil Babka <vbabka@xxxxxxx>
> > > > > > > Cc: Roman Gushchin <roman.gushchin@xxxxxxxxx>
> > > > > > > Cc: Hyeonggon Yoo <42.hyeyoo@xxxxxxxxx>
> > > > > > > Cc: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
> > > > > > > Cc: Kees Cook <kees@xxxxxxxxxx>
> > > > > > > Cc: "Eugenio Pérez" <eperezma@xxxxxxxxxx>
> > > > > > > Cc: Hailong.Liu <hailong.liu@xxxxxxxx>
> > > > > > > Cc: Jason Wang <jasowang@xxxxxxxxxx>
> > > > > > > Cc: Maxime Coquelin <maxime.coquelin@xxxxxxxxxx>
> > > > > > > Cc: "Michael S. Tsirkin" <mst@xxxxxxxxxx>
> > > > > > > Cc: Xuan Zhuo <xuanzhuo@xxxxxxxxxxxxxxxxx>
> > > > > > > ---
> > > > > > >  mm/page_alloc.c | 10 +++++-----
> > > > > > >  1 file changed, 5 insertions(+), 5 deletions(-)
> > > > > > >
> > > > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > > > > > index d2c37f8f8d09..fb5850ecd3ae 100644
> > > > > > > --- a/mm/page_alloc.c
> > > > > > > +++ b/mm/page_alloc.c
> > > > > > > @@ -4399,11 +4399,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> > > > > > >          */
> > > > > > >         if (gfp_mask & __GFP_NOFAIL) {
> > > > > > >                 /*
> > > > > > > -                * All existing users of the __GFP_NOFAIL are blockable, so warn
> > > > > > > -                * of any new users that actually require GFP_NOWAIT
> > > > > > > +                * All existing users of the __GFP_NOFAIL are blockable
> > > > > > > +                * otherwise we introduce a busy loop with inside the page
> > > > > > > +                * allocator from non-sleepable contexts
> > > > > > >                  */
> > > > > > > -               if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask))
> > > > > > > -                       goto fail;
> > > > > > > +               BUG_ON(!can_direct_reclaim);
> > > > > >
> > > > > > I'm not in favor of using BUG_ON() here, as many call sites already
> > > > > > handle the return value of __GFP_NOFAIL.
> > > > > >
> > > > >
> > > > > it is not correct to handle the return value of __GFP_NOFAIL.
> > > > > if you check the ret, don't use __GFP_NOFAIL.
> > > >
> > > > If so, you have many code changes to make in the linux kernel ;)
> > > >
> > >
> > > Please list those code using __GFP_NOFAIL and check the result
> > > might fail, we should get them fixed. This is insane. NOFAIL means
> > > no fail.
>
> You can find some instances with grep commands, but there's no
> reliable way to capture them all with a single command. Here are a few
> examples:
>
>                 // drivers/infiniband/hw/cxgb4/mem.c
>                 skb = alloc_skb(wr_len, GFP_KERNEL | __GFP_NOFAIL);
>                 if (!skb)
>                         return -ENOMEM;
>

Thanks for pointing out this wrong code.

This is absolutely wrong. with GFP_KERNEL | __GFP_NOFAIL,
mm-core is doing endless retry till direct reclamation gives
memory to itself. there is no possibility to return NULL.

the only possibility is __GFP_NOFAIL | GFP_ATOMIC(NOWAIT),
mm-core is returning NULL incorrectly. actually mm-core just ignores
this __GFP_NOFAIL in this case. with GFP_FAIL enforcing
direct_reclamation done, the wrong case __GFP_NOFAIL |
GFP_ATOMIC(NOWAIT) will no longer exist.

>         // fs/xfs/libxfs/xfs_dir2.c
>         args = kzalloc(sizeof(*args), GFP_KERNEL | __GFP_NOFAIL);
>         if (!args)
>                 return -ENOMEM;
>
>
> --
> Regards
> Yafang