Re: [PATCH bpf-next v9 2/6] mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation

Mateusz Guzik <mjguzik@xxxxxxxxx> · Tue, 11 Mar 2025 19:04:47 +0100

On Tue, Mar 11, 2025 at 02:32:24PM +0100, Alexei Starovoitov wrote:
> On Tue, Mar 11, 2025 at 3:04 AM Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> wrote:
> >
> > On Fri, 21 Feb 2025 18:44:23 -0800 Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> wrote:
> >
> > > Tracing BPF programs execute from tracepoints and kprobes where
> > > running context is unknown, but they need to request additional
> > > memory. The prior workarounds were using pre-allocated memory and
> > > BPF specific freelists to satisfy such allocation requests.
> >
> > The "prior workarounds" sound entirely appropriate.  Because the
> > performance and maintainability of Linux's page allocator is about
> > 1,000,040 times more important than relieving BPF of having to carry a
> > "workaround".
> 
> Please explain where performance and maintainability is affected?
> 

I have some related questions below. Note I'm a bystander, not claiming
to have any (N)ACK power.

A small bit before that:
       if (!spin_trylock_irqsave(&zone->lock, flags)) {
	       if (unlikely(alloc_flags & ALLOC_TRYLOCK))
		       return NULL;
	       spin_lock_irqsave(&zone->lock, flags);
       }

This is going to perform worse when contested due to an extra access to
the lock. I presume it was done this way to avoid suffering another
branch, with the assumption the trylock is normally going to succeed.

So happens I'm looking at parallel exec on the side and while currently
there is bigger fish to fry, contention on zone->lock is very much a
factor. Majority of it comes from RCU freeing (in free_pcppages_bulk()),
but I also see several rmqueue calls below. As they trylock, they are
going to make it more expensive for free_pcppages_bulk() to even get the
lock.

So this *is* contested, but at the moment is largely overshadowed by
bigger problems (which someone(tm) hopefully will sort out sooner than
later).

So should this land, I expect someone is going to hoist the trylock at
some point in the future. If it was my patch I would just do it now, but
I understand this may result in new people showing up and complaining.

> As far as motivation, if I recall correctly, you were present in
> the room when Vlastimil presented the next steps for SLUB at
> LSFMM back in May of last year.
> A link to memory refresher is in the commit log:
> https://lwn.net/Articles/974138/
> 
> Back then he talked about a bunch of reasons including better
> maintainability of the kernel overall, but what stood out to me
> as the main reason to use SLUB for bpf, objpool, mempool,
> and networking needs is prevention of memory waste.
> All these wrappers of slub pin memory that should be shared.
> bpf, objpool, mempools should be good citizens of the kernel
> instead of stealing the memory. That's the core job of the
> kernel. To share resources. Memory is one such resource.
> 

I suspect the worry is that the added behavior may complicate things
down the road (or even prevent some optimizations) -- there is another
context to worry about.

I think it would help to outline why these are doing any memory
allocation from something like NMI to begin with. Perhaps you could have
carved out a very small piece of memory as a reserve just for that? It
would be refilled as needed from process context.

A general remark is that support for an arbitrary running context in
core primitives artificially limits what can be done to optimize them
for their most common users. imo the sheaves patchset is a little bit of
an admission (also see below). It may be the get pages routine will get
there.

If non-task memory allocs got beaten to the curb, or at least got heavily
limited, then a small allocator just for that purpose would do the
trick and the two variants would likely be simpler than one thing which
supports everyone. This patchset is a step in the opposite direction,
but it may be there is a good reason.

To my understanding ebpf can be used to run "real" code to do something
or "merely" collect data. I presume the former case is already running
from a context where it can allocate memory no problem.

For the latter I presume ebpf has conflicting goals: 1. not disrupt the
workload under observation (cpu and ram) -- to that end small memory
usage limits are in place. otherwise a carelessly written aggregation
can OOM the box (e.g., say someone wants to know which files get opened
the most and aggregates on names, while a malicious user opens and
unlinks autogenerated names, endlessly growing the list if you let it)
2. actually log stuff even if resources are scarce. to that end I
would expect that a small area is pre-allocated and periodically drained

Which for me further puts a question mark on general alloc from the NMI
context.

All that said, the cover letter:
> The main motivation is to make alloc page and slab reentrant and
> remove bpf_mem_alloc.

does not justify why ebpf performs allocations in a manner which warrant
any of this, which I suspect is what Andrew asked about.

I never put any effort into ebpf -- it may be all the above makes
excellent sense. But then you need to make a case to the people
maintaining the code.