Re: [PATCH v3 0/4] mm: clarify nofail memory allocation

Vlastimil Babka <vbabka@xxxxxxx> · Tue, 27 Aug 2024 09:38:50 +0200

On 8/27/24 09:15, Barry Song wrote:
> On Tue, Aug 27, 2024 at 12:10 AM Vlastimil Babka <vbabka@xxxxxxx> wrote:
>>
>> On 8/22/24 11:34, Linus Torvalds wrote:
>> > On Thu, 22 Aug 2024 at 17:27, David Hildenbrand <david@xxxxxxxxxx> wrote:
>> >>
>> >> To me, that implies that if you pass in MAX_ORDER+1 the VM will "retry
>> >> infinitely". if that implies just OOPSing or actually be in a busy loop,
>> >> I don't care. It could effectively happen with MAX_ORDER as well, as
>> >> stated. But certainly not BUG_ON.
>> >
>> > No BUG_ON(), but also no endless loop.
>> >
>> > Just return NULL for bogus users. Really. Give a WARN_ON_ONCE() to
>> > make it easy to find offenders, and then let them deal with it.
>>
>> Right now we give the WARN_ON_ONCE() (for !can_direct_reclaim) only when
>> we're about to actually return NULL, so the memory has to be depleted
>> already. To make it easier to find the offenders much more reliably, we
>> should consider doing it sooner, but also not add unnecessary overhead to
>> allocator fastpaths just because of the potentially buggy users. So either
>> always in __alloc_pages_slowpath(), which should be often enough (unless the
>> system never needs to wake up kswapd to reclaim) but with negligible enough
>> overhead, or on every allocation but only with e.g. CONFIG_DEBUG_VM?
> 
> We already have a WARN_ON for order > 1 in rmqueue. we might extend
> the condition there to include checking flags as well?

Ugh, wasn't aware, well spotted. So it means there at least shouldn't be
existing users of __GFP_NOFAIL with order > 1 :)

But also the check is in the hotpath, even before trying the pcplists, so we
could move it to __alloc_pages_slowpath() while extending it?

> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 7dcb0713eb57..b5717c6569f9 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3071,8 +3071,11 @@ struct page *rmqueue(struct zone *preferred_zone,
>   /*
>   * We most definitely don't want callers attempting to
>   * allocate greater than order-1 page units with __GFP_NOFAIL.
> + * Also we don't support __GFP_NOFAIL without __GFP_DIRECT_RECLAIM,
> + * which can result in a lockup
>   */
> - WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
> + WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) &&
> +     (order > 1 || !(gfp_flags & __GFP_DIRECT_RECLAIM)));
> 
>   if (likely(pcp_allowed_order(order))) {
>   page = rmqueue_pcplist(preferred_zone, zone, order,
> 
>>
>> > Don't take it upon yourself to say "we have to deal with any amount of
>> > stupidity".
>> >
>> > The MM layer is not some slave to users. The MM layer is one of the
>> > most core pieces of code in the kernel, and as such the MM layer is
>> > damn well in charge.
>> >
>> > Nobody has the right to say "I will not deal with allocation
>> > failures". The MM should not bend over backwards over something like
>> > that.
>> >
>> > Seriously. Get a spine already, people. Tell random drivers that claim
>> > that they cannot deal with errors to just f-ck off.
>> >
>> > And you don't do it by looping forever, and you don't do it by killing
>> > the kernel. You do it by ignoring their bullying tactics.
>> >
>> > Then you document the *LIMITED* cases where you actually will try forever.
>> >
>> > This discussion has gone on for too damn long.
>> >
>> >               Linus
>>