Re: [PATCH v3 bpf-next 00/15] bpf: BPF specific memory allocator, UAPI in particular

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Wed, 31 Aug 2022 20:55:51 -0700

On Wed, Aug 31, 2022 at 2:02 PM Delyan Kratunov <delyank@xxxxxx> wrote:
> Given that tracing programs can't really maintain their own freelists safely (I think
> they're missing the building blocks - you can't cmpxchg kptrs),

Today? yes, but soon we will have link lists supported natively.

> I do feel like
> isolated allocators are a requirement here. Without them, allocations can fail and
> there's no way to write a reliable program.

Completely agree that there should be a way for programs
to guarantee availability of the element.
Inline allocation can fail regardless whether allocation pool
is shared by multiple programs or a single program owns an allocator.
In that sense, allowing multiple programs to create an instance
of an allocator doesn't solve this problem.
Short free list inside bpf_mem_cache is an implementation detail.
"prefill to guarantee successful alloc" is a bit out of scope
of an allocator.
"allocate a set and stash it" should be a separate building block
available to bpf progs when step "allocate" can fail and
efficient "stash it" can probably be done on top of the link list.

> *If* we ensure that you can build a usable freelist out of allocator-backed memory
> for (a set of) nmi programs, then I can maybe get behind this (but there's other
> reasons not to do this).

Agree that nmi adds another quirk to "stash it" step.
If native link list is not going to work then something
else would have to be designed.

> > So option 3 doesn't feel less flexible to me. imo the whole-map-allocator is
> > more than we need. Ideally it would be easy to specifiy one single
> > allocator for all maps and progs in a set of .c files. Sort-of a bpf package.
> > In other words one bpf allocator per bpf "namespace" is more than enough.
>
> _Potentially_. Programs need to know that when they reserved X objects, they'll have
> them available at a later time and any sharing with other programs can remove this
> property.

Agree.

> A _set_ of programs can in theory determine the right prefill levels, but
> this is certainly easier to reason about on a per-program basis, given that programs
> will run at different rates.

Agree as well.

> Why does it require a global allocator? For example, you can have each program have
> its own internal allocator and with runtime live counts, this API is very achievable.
> Once the program unloads, you can drain the freelists, so most allocator memory does
> not have to live as long as the longest-lived object from that allocator. In
> addition, all allocators can share a global freelist too, so chunks released after
> the program unloads get a chance to be reused.

All makes sense to me except that the kernel can provide that
global allocator and per-program "allocators" can hopefully be
implemented as native bpf code.

> How is having one allocator per program different from having one allocator per set
> of programs, with per-program bpf-side freelists? The requirement that some (most?)
> programs need deterministic access to their freelists is still there, no matter the
> number of allocators. If we fear that the default freelist behavior will waste
> memory, then the defaults need to be aggressively conservative, with programs being
> able to adjust them.

I think the disagreement here is that per-prog allocator based
on bpf_mem_alloc isn't going to be any more deterministic than
one global bpf_mem_alloc for all progs.
Per-prog short free list of ~64 elements vs
global free list of ~64 elements.
In both cases these lists will have to do irq_work and refill
out of global slabs.

> Besides, if we punt the freelists to bpf, then we get absolutely no control over the
> memory usage, which is strictly worse for us (and worse developer experience on top).

I don't understand this point.
All allocations are still coming out of bpf_mem_alloc.
We can have debug mode with memleak support and other debug
mechanisms.

> > (The profileration of kmem_cache-s in the past
> > forced merging of them). By restricting bpf program choices with allocator-per-map
> > (this option 3) we're not only making the kernel side to do less work
> > (no run-time ref counts, no merging is required today), we're also pushing
> > bpf progs to use memory concious choices.
>
> This is conflating "there needs to be a limit on memory stuck in freelists" with "you
> can only store kptrs from one allocator in each map." The former practically
> advocates for freelists to _not_ be hand-rolled inside bpf progs. I still disagree
> with the latter - it's coming strictly from the desire to have static mappings
> between object storage and allocators; it's not coming from a memory usage need, it
> only avoids runtime live object counts.
>
> > Having said all that maybe one global allocator is not such a bad idea.
>
> It _is_ a bad idea because it doesn't have freelist usage determinism. I do, however,
> think there is value in having precise and conservative freelist policies, along with
> a global freelist for overflow and draining after program unload. The latter would
> allow us to share memory between allocators without sacrificing per-allocator
> freelist determinism, especially if paired with very static (but configurable)
> freelist thresholds.

What is 'freelist determinism' ?
Are you talking about some other freelist on top of bpf_mem_alloc's
free lists ?