On Sun, Aug 18, 2024 at 2:45 PM Barry Song <21cnbao@xxxxxxxxx> wrote: > > On Sun, Aug 18, 2024 at 6:27 PM Barry Song <21cnbao@xxxxxxxxx> wrote: > > > > On Sun, Aug 18, 2024 at 5:51 PM Yafang Shao <laoar.shao@xxxxxxxxx> wrote: > > > > > > On Sun, Aug 18, 2024 at 11:48 AM Barry Song <21cnbao@xxxxxxxxx> wrote: > > > > > > > > On Sun, Aug 18, 2024 at 2:55 PM Yafang Shao <laoar.shao@xxxxxxxxx> wrote: > > > > > > > > > > On Sat, Aug 17, 2024 at 2:25 PM Barry Song <21cnbao@xxxxxxxxx> wrote: > > > > > > > > > > > > From: Barry Song <v-songbaohua@xxxxxxxx> > > > > > > > > > > > > When users allocate memory with the __GFP_NOFAIL flag, they might > > > > > > incorrectly use it alongside GFP_ATOMIC, GFP_NOWAIT, etc. This kind of > > > > > > non-blockable __GFP_NOFAIL is not supported and is pointless. If we > > > > > > attempt and still fail to allocate memory for these users, we have two > > > > > > choices: > > > > > > > > > > > > 1. We could busy-loop and hope that some other direct reclamation or > > > > > > kswapd rescues the current process. However, this is unreliable > > > > > > and could ultimately lead to hard or soft lockups, > > > > > > > > > > That can occur even if we set both __GFP_NOFAIL and > > > > > __GFP_DIRECT_RECLAIM, right? So, I don't believe the issue is related > > > > > to setting __GFP_DIRECT_RECLAIM; rather, it stems from the flawed > > > > > design of __GFP_NOFAIL itself. > > > > > > > > the point of GFP_NOFAIL is that it won't fail and its user won't check > > > > the return value. without direct_reclamation, it is sometimes impossible. > > > > but with direct reclamation, users constantly wait and finally they can > > > > > > So, what exactly is the difference between 'constantly waiting' and > > > 'busy looping'? Could you please clarify? Also, why can't we > > > 'constantly wait' when __GFP_DIRECT_RECLAIM is not set? > > > > I list two options in changelog > > 1: busy loop 2. bug_on. I am actually fine with either one. either one is > > better than the existing code. but returning null in the current code > > is definitely wrong. > > > > 1 somehow has the attempt to make __GFP_NOFAIL without direct_reclamation > > legal. so it is a bit suspicious going in the wrong direction. > > > > busy-loop is that you are not reclaiming memory you are not sleeping. > > cpu is constantly working and busy, so it might result in a lockup, either > > soft lockup or hard lockup. Thanks for the clarification. That can be avoided by a simple cond_resched() if the hard lockup or softlockup is the main issue ;) > > > > with direct_reclamation, wait is the case you can sleep. it is not holding > > cpu, not a busy loop. in rare case, users might end in endless wait, > > but it matches the doc of __GFP_NOFAIL, never return till memory > > is gotten (the current code is implemented in this way unless users > > incorrectly combine __GFP_NOFAIL with aotmic/nowait etc.) > > > > and the essential difference between "w/ and w/o direct_reclaim": with > direct reclaim, the user is actively reclaiming memory to rescue itself > by all kinds of possible ways(compact, oom, reclamation), while without > direct reclamation, it can do nothing and just loop (busy-loop). It can wake up kswapd, which can then reclaim memory. If kswapd can't keep up, the system is likely under heavy memory pressure. In such a case, it makes little difference whether __GFP_DIRECT_RECLAIM is set or not. For reference, see the old issue: https://lore.kernel.org/lkml/d9802b6a-949b-b327-c4a6-3dbca485ec20@xxxxxxx/. I believe the core issue persists, and the design of __GFP_NOFAIL exacerbates it. By the way, I believe we could trigger an asynchronous OOM kill in the case without direct reclaim to avoid busy looping. > > > note, long-term we won't expose __GFP_NOFAIL any more. we > > will only expose GFP_NOFAIL which enforces Blockable. I am > > quite busy on other issues, so this won't happen in a short time. > > > > > > > > > get memory. if you read the doc of __GFP_NOFAIL you will find it. > > > > it is absolutely clearly documented. > > > > > > > > > > > > > > > which might not > > > > > > be well supported by some architectures. > > > > > > > > > > > > 2. We could use BUG_ON to trigger a reliable system crash, avoiding > > > > > > exposing NULL dereference. > > > > > > > > > > > > Neither option is ideal, but both are improvements over the existing code. > > > > > > This patch selects the second option because, with the introduction of > > > > > > scoped API and GFP_NOFAIL—capable of enforcing direct reclamation for > > > > > > nofail users(which is in my plan), non-blockable nofail allocations will > > > > > > no longer be possible. > > > > > > > > > > > > Signed-off-by: Barry Song <v-songbaohua@xxxxxxxx> > > > > > > Cc: Michal Hocko <mhocko@xxxxxxxx> > > > > > > Cc: Uladzislau Rezki (Sony) <urezki@xxxxxxxxx> > > > > > > Cc: Christoph Hellwig <hch@xxxxxxxxxxxxx> > > > > > > Cc: Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx> > > > > > > Cc: Christoph Lameter <cl@xxxxxxxxx> > > > > > > Cc: Pekka Enberg <penberg@xxxxxxxxxx> > > > > > > Cc: David Rientjes <rientjes@xxxxxxxxxx> > > > > > > Cc: Joonsoo Kim <iamjoonsoo.kim@xxxxxxx> > > > > > > Cc: Vlastimil Babka <vbabka@xxxxxxx> > > > > > > Cc: Roman Gushchin <roman.gushchin@xxxxxxxxx> > > > > > > Cc: Hyeonggon Yoo <42.hyeyoo@xxxxxxxxx> > > > > > > Cc: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> > > > > > > Cc: Kees Cook <kees@xxxxxxxxxx> > > > > > > Cc: "Eugenio Pérez" <eperezma@xxxxxxxxxx> > > > > > > Cc: Hailong.Liu <hailong.liu@xxxxxxxx> > > > > > > Cc: Jason Wang <jasowang@xxxxxxxxxx> > > > > > > Cc: Maxime Coquelin <maxime.coquelin@xxxxxxxxxx> > > > > > > Cc: "Michael S. Tsirkin" <mst@xxxxxxxxxx> > > > > > > Cc: Xuan Zhuo <xuanzhuo@xxxxxxxxxxxxxxxxx> > > > > > > --- > > > > > > mm/page_alloc.c | 10 +++++----- > > > > > > 1 file changed, 5 insertions(+), 5 deletions(-) > > > > > > > > > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > > > > > index d2c37f8f8d09..fb5850ecd3ae 100644 > > > > > > --- a/mm/page_alloc.c > > > > > > +++ b/mm/page_alloc.c > > > > > > @@ -4399,11 +4399,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, > > > > > > */ > > > > > > if (gfp_mask & __GFP_NOFAIL) { > > > > > > /* > > > > > > - * All existing users of the __GFP_NOFAIL are blockable, so warn > > > > > > - * of any new users that actually require GFP_NOWAIT > > > > > > + * All existing users of the __GFP_NOFAIL are blockable > > > > > > + * otherwise we introduce a busy loop with inside the page > > > > > > + * allocator from non-sleepable contexts > > > > > > */ > > > > > > - if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask)) > > > > > > - goto fail; > > > > > > + BUG_ON(!can_direct_reclaim); > > > > > > > > > > I'm not in favor of using BUG_ON() here, as many call sites already > > > > > handle the return value of __GFP_NOFAIL. > > > > > > > > > > > > > it is not correct to handle the return value of __GFP_NOFAIL. > > > > if you check the ret, don't use __GFP_NOFAIL. > > > > > > If so, you have many code changes to make in the linux kernel ;) > > > > > > > Please list those code using __GFP_NOFAIL and check the result > > might fail, we should get them fixed. This is insane. NOFAIL means > > no fail. You can find some instances with grep commands, but there's no reliable way to capture them all with a single command. Here are a few examples: // drivers/infiniband/hw/cxgb4/mem.c skb = alloc_skb(wr_len, GFP_KERNEL | __GFP_NOFAIL); if (!skb) return -ENOMEM; // fs/xfs/libxfs/xfs_dir2.c args = kzalloc(sizeof(*args), GFP_KERNEL | __GFP_NOFAIL); if (!args) return -ENOMEM; -- Regards Yafang