Re: [PATCH bpf-next v9 2/6] mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Fri, 14 Mar 2025 17:34:08 -0700

On Tue, Mar 11, 2025 at 11:04 AM Mateusz Guzik <mjguzik@xxxxxxxxx> wrote:
>
> On Tue, Mar 11, 2025 at 02:32:24PM +0100, Alexei Starovoitov wrote:
> > On Tue, Mar 11, 2025 at 3:04 AM Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> wrote:
> > >
> > > On Fri, 21 Feb 2025 18:44:23 -0800 Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> wrote:
> > >
> > > > Tracing BPF programs execute from tracepoints and kprobes where
> > > > running context is unknown, but they need to request additional
> > > > memory. The prior workarounds were using pre-allocated memory and
> > > > BPF specific freelists to satisfy such allocation requests.
> > >
> > > The "prior workarounds" sound entirely appropriate.  Because the
> > > performance and maintainability of Linux's page allocator is about
> > > 1,000,040 times more important than relieving BPF of having to carry a
> > > "workaround".
> >
> > Please explain where performance and maintainability is affected?
> >
>
> I have some related questions below. Note I'm a bystander, not claiming
> to have any (N)ACK power.
>
> A small bit before that:
>        if (!spin_trylock_irqsave(&zone->lock, flags)) {
>                if (unlikely(alloc_flags & ALLOC_TRYLOCK))
>                        return NULL;
>                spin_lock_irqsave(&zone->lock, flags);
>        }
>
> This is going to perform worse when contested due to an extra access to
> the lock. I presume it was done this way to avoid suffering another
> branch, with the assumption the trylock is normally going to succeed.

Steven already explained that extra trylock is a noise from performance pov.
Also notice that it's indeed not written as
if (unlikely(alloc_flags & ALLOC_TRYLOCK))
    spin_trylock_irqsave(...);
else
    spin_lock_irqsave(...);

because this way it will suffer an extra branch.
Even with this extra branch it would have been in the noise
considering all the prior branches rmqueue*() does.
With unconditional trylock first the performance noise is even less
noticeable.

> I think it would help to outline why these are doing any memory
> allocation from something like NMI to begin with. Perhaps you could have
> carved out a very small piece of memory as a reserve just for that? It
> would be refilled as needed from process context.

It's not about NMI. It's about an unknown context.
bpf and other subsystems will not be calling it from NMI on purpose.
It can happen by accident.
See netpoll_send_udp().
It's using alloc_skb(GFP_ATOMIC) which is the same as GFP_WISH_ME_LUCK.
netpoll is called from an arbitrary context where accessing slab is not ok.

> If non-task memory allocs got beaten to the curb, or at least got heavily
> limited, then a small allocator just for that purpose would do the
> trick and the two variants would likely be simpler than one thing which
> supports everyone.

Try diff stat of this patch vs diff of bpf_mem_alloc and other hacks
to see what it takes to build "small allocator".
Also notice how "small allocator" doesn't work for netpoll.
It needs an skb in an unknown context and currently pretends
that GFP_ATOMIC isn't going to crash.
Any context kmalloc will finally fix it.