Re: [PATCH bpf-next v5 1/7] mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Wed, 15 Jan 2025 15:00:08 -0800

On Wed, Jan 15, 2025 at 3:19 AM Vlastimil Babka <vbabka@xxxxxxx> wrote:
>
> On 1/15/25 03:17, Alexei Starovoitov wrote:
> > From: Alexei Starovoitov <ast@xxxxxxxxxx>
> >
> > Tracing BPF programs execute from tracepoints and kprobes where
> > running context is unknown, but they need to request additional
> > memory. The prior workarounds were using pre-allocated memory and
> > BPF specific freelists to satisfy such allocation requests.
> > Instead, introduce gfpflags_allow_spinning() condition that signals
> > to the allocator that running context is unknown.
> > Then rely on percpu free list of pages to allocate a page.
> > try_alloc_pages() -> get_page_from_freelist() -> rmqueue() ->
> > rmqueue_pcplist() will spin_trylock to grab the page from percpu
> > free list. If it fails (due to re-entrancy or list being empty)
> > then rmqueue_bulk()/rmqueue_buddy() will attempt to
> > spin_trylock zone->lock and grab the page from there.
> > spin_trylock() is not safe in RT when in NMI or in hard IRQ.
> > Bailout early in such case.
> >
> > The support for gfpflags_allow_spinning() mode for free_page and memcg
> > comes in the next patches.
> >
> > This is a first step towards supporting BPF requirements in SLUB
> > and getting rid of bpf_mem_alloc.
> > That goal was discussed at LSFMM: https://lwn.net/Articles/974138/
> >
> > Acked-by: Michal Hocko <mhocko@xxxxxxxx>
> > Signed-off-by: Alexei Starovoitov <ast@xxxxxxxxxx>
>
> Acked-by: Vlastimil Babka <vbabka@xxxxxxx>
>
> Some nits below:
>
> > ---
> >  include/linux/gfp.h | 22 ++++++++++
> >  mm/internal.h       |  1 +
> >  mm/page_alloc.c     | 98 +++++++++++++++++++++++++++++++++++++++++++--
> >  3 files changed, 118 insertions(+), 3 deletions(-)
> >
> > diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> > index b0fe9f62d15b..b41bb6e01781 100644
> > --- a/include/linux/gfp.h
> > +++ b/include/linux/gfp.h
> > @@ -39,6 +39,25 @@ static inline bool gfpflags_allow_blocking(const gfp_t gfp_flags)
> >       return !!(gfp_flags & __GFP_DIRECT_RECLAIM);
> >  }
> >
> > +static inline bool gfpflags_allow_spinning(const gfp_t gfp_flags)
> > +{
> > +     /*
> > +      * !__GFP_DIRECT_RECLAIM -> direct claim is not allowed.
> > +      * !__GFP_KSWAPD_RECLAIM -> it's not safe to wake up kswapd.
> > +      * All GFP_* flags including GFP_NOWAIT use one or both flags.
> > +      * try_alloc_pages() is the only API that doesn't specify either flag.
> > +      *
> > +      * This is stronger than GFP_NOWAIT or GFP_ATOMIC because
> > +      * those are guaranteed to never block on a sleeping lock.
> > +      * Here we are enforcing that the allaaction doesn't ever spin
>
>                                           allocation

oops.

> > +      * on any locks (i.e. only trylocks). There is no highlevel
> > +      * GFP_$FOO flag for this use in try_alloc_pages() as the
> > +      * regular page allocator doesn't fully support this
> > +      * allocation mode.
> > +      */
> > +     return !(gfp_flags & __GFP_RECLAIM);
> > +}
>
> This function seems unused, guess the following patches will use.
>
> > @@ -4509,7 +4517,8 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
> >
> >       might_alloc(gfp_mask);
> >
> > -     if (should_fail_alloc_page(gfp_mask, order))
> > +     if (!(*alloc_flags & ALLOC_TRYLOCK) &&
> > +         should_fail_alloc_page(gfp_mask, order))
>
> Is it because should_fail_alloc_page() might take some lock or whatnot?
> Maybe comment?

This is mainly because all kinds of printk-s that fail* logic does.
We've seen way too many syzbot reports because of printk from
the unsupported context.
This part of the comment:
+    * Specify __GFP_NOWARN since failing try_alloc_pages() is not a reason
+    * to warn. Also warn would trigger printk() which is unsafe from
+    * various contexts. We cannot use printk_deferred_enter() to mitigate,
+    * since the running context is unknown.

and also because of get_random_u32() inside fail* logic
that grabs spin_lock.

We've seen syzbot reports about get_random() deadlocking.

Fixing printk and fail*/get_random is necessary too,
but orthogonal work for these patches.

Here I'm preventing this path from prepare_alloc_pages() to avoid
dealing with more syzbot reports than already there.

With try_alloc_pages() out of bpf attached to kprobe inside
printk core logic it would be too easy for syzbot.
And then people will yell at bpf causing problems
whereas the root cause is printk being broken.
We see this finger pointing all too often.

>
> >               return false;
> >
> >       *alloc_flags = gfp_to_alloc_flags_cma(gfp_mask, *alloc_flags);
> > @@ -7023,3 +7032,86 @@ static bool __free_unaccepted(struct page *page)
> >  }
> >
> >  #endif /* CONFIG_UNACCEPTED_MEMORY */
> > +
> > +/**
> > + * try_alloc_pages_noprof - opportunistic reentrant allocation from any context
> > + * @nid - node to allocate from
> > + * @order - allocation order size
> > + *
> > + * Allocates pages of a given order from the given node. This is safe to
> > + * call from any context (from atomic, NMI, and also reentrant
> > + * allocator -> tracepoint -> try_alloc_pages_noprof).
> > + * Allocation is best effort and to be expected to fail easily so nobody should
> > + * rely on the success. Failures are not reported via warn_alloc().
> > + *
> > + * Return: allocated page or NULL on failure.
> > + */
> > +struct page *try_alloc_pages_noprof(int nid, unsigned int order)
> > +{
> > +     /*
> > +      * Do not specify __GFP_DIRECT_RECLAIM, since direct claim is not allowed.
> > +      * Do not specify __GFP_KSWAPD_RECLAIM either, since wake up of kswapd
> > +      * is not safe in arbitrary context.
> > +      *
> > +      * These two are the conditions for gfpflags_allow_spinning() being true.
> > +      *
> > +      * Specify __GFP_NOWARN since failing try_alloc_pages() is not a reason
> > +      * to warn. Also warn would trigger printk() which is unsafe from
> > +      * various contexts. We cannot use printk_deferred_enter() to mitigate,
> > +      * since the running context is unknown.
> > +      *
> > +      * Specify __GFP_ZERO to make sure that call to kmsan_alloc_page() below
> > +      * is safe in any context. Also zeroing the page is mandatory for
> > +      * BPF use cases.
> > +      *
> > +      * Though __GFP_NOMEMALLOC is not checked in the code path below,
> > +      * specify it here to highlight that try_alloc_pages()
> > +      * doesn't want to deplete reserves.
> > +      */
> > +     gfp_t alloc_gfp = __GFP_NOWARN | __GFP_ZERO | __GFP_NOMEMALLOC;
> > +     unsigned int alloc_flags = ALLOC_TRYLOCK;
> > +     struct alloc_context ac = { };
> > +     struct page *page;
> > +
> > +     /*
> > +      * In RT spin_trylock() may call raw_spin_lock() which is unsafe in NMI.
> > +      * If spin_trylock() is called from hard IRQ the current task may be
> > +      * waiting for one rt_spin_lock, but rt_spin_trylock() will mark the
> > +      * task as the owner of another rt_spin_lock which will confuse PI
> > +      * logic, so return immediately if called form hard IRQ or NMI.
> > +      *
> > +      * Note, irqs_disabled() case is ok. This function can be called
> > +      * from raw_spin_lock_irqsave region.
> > +      */
> > +     if (IS_ENABLED(CONFIG_PREEMPT_RT) && (in_nmi() || in_hardirq()))
> > +             return NULL;
> > +     if (!pcp_allowed_order(order))
> > +             return NULL;
> > +
> > +#ifdef CONFIG_UNACCEPTED_MEMORY
> > +     /* Bailout, since try_to_accept_memory_one() needs to take a lock */
> > +     if (has_unaccepted_memory())
> > +             return NULL;
> > +#endif
> > +     /* Bailout, since _deferred_grow_zone() needs to take a lock */
> > +     if (deferred_pages_enabled())
> > +             return NULL;
>
> Is it fine for BPF that things will fail to allocate until all memory is
> deferred-initialized and accepted? I guess it's easy to teach those places
> later to evaluate if they can take the lock.

Exactly. If it becomes an issue it can be addressed.
I didn't want to complicate the code just-because.

> > +
> > +     if (nid == NUMA_NO_NODE)
> > +             nid = numa_node_id();
> > +
> > +     prepare_alloc_pages(alloc_gfp, order, nid, NULL, &ac,
> > +                         &alloc_gfp, &alloc_flags);
> > +
> > +     /*
> > +      * Best effort allocation from percpu free list.
> > +      * If it's empty attempt to spin_trylock zone->lock.
> > +      */
> > +     page = get_page_from_freelist(alloc_gfp, order, alloc_flags, &ac);
>
> What about set_page_owner() from post_alloc_hook() and it's stackdepot
> saving. I guess not an issue until try_alloc_pages() gets used later, so
> just a mental note that it has to be resolved before. Or is it actually safe?

set_page_owner() should be fine.
save_stack() has in_page_owner recursion protection mechanism.

stack_depot_save_flags() may be problematic if there is another
path to it.
I guess I can do:

diff --git a/lib/stackdepot.c b/lib/stackdepot.c
index 245d5b416699..61772bc4b811 100644
--- a/lib/stackdepot.c
+++ b/lib/stackdepot.c
@@ -630,7 +630,7 @@ depot_stack_handle_t
stack_depot_save_flags(unsigned long *entries,
                        prealloc = page_address(page);
        }

-       if (in_nmi()) {
+       if (in_nmi() || !gfpflags_allow_spinning(alloc_flags)) {
                /* We can never allocate in NMI context. */
                WARN_ON_ONCE(can_alloc);
                /* Best effort; bail if we fail to take the lock. */
                if (!raw_spin_trylock_irqsave(&pool_lock, flags))
                        goto exit;

as part of this patch,
but not convinced whether it's necessary.
stack_depot* is effectively noinstr.
kprobe-bpf cannot be placed in there and afaict
it doesn't call any tracepoints.
So in_nmi() is the only way to reenter and that's already covered.