Re: [PATCH v3 bpf-next 00/15] bpf: BPF specific memory allocator, UAPI in particular

Delyan Kratunov <delyank@xxxxxx> · Thu, 1 Sep 2022 22:46:09 +0000

On Wed, 2022-08-31 at 20:55 -0700, Alexei Starovoitov wrote:
> !-------------------------------------------------------------------|
>   This Message Is From an External Sender
> 
> > -------------------------------------------------------------------!
> 
> On Wed, Aug 31, 2022 at 2:02 PM Delyan Kratunov <delyank@xxxxxx> wrote:
> > Given that tracing programs can't really maintain their own freelists safely (I think
> > they're missing the building blocks - you can't cmpxchg kptrs),
> 
> Today? yes, but soon we will have link lists supported natively.
> 
> > I do feel like
> > isolated allocators are a requirement here. Without them, allocations can fail and
> > there's no way to write a reliable program.
> 
> Completely agree that there should be a way for programs
> to guarantee availability of the element.
> Inline allocation can fail regardless whether allocation pool
> is shared by multiple programs or a single program owns an allocator.

I'm not sure I understand this point. 
If I call bpf_mem_prefill(20, type_id(struct foo)), I would expect the next 20 allocs
for struct foo to succeed. In what situations can this fail if I'm the only program
using it _and_ I've calculated the prefill amount correctly?

Unless you're saying that the prefill wouldn't adjust the freelist limits, in which
case, I think that's a mistake - prefill should effectively _set_ the freelist
limits.

> In that sense, allowing multiple programs to create an instance
> of an allocator doesn't solve this problem.
> Short free list inside bpf_mem_cache is an implementation detail.
> "prefill to guarantee successful alloc" is a bit out of scope
> of an allocator.

I disagree that it's out of scope. This is the only access to dynamic memory from a
bpf program, it makes sense that it covers the requirements of bpf programs,
including prefill/freelist behavior, so all programs can safely use it.

> "allocate a set and stash it" should be a separate building block
> available to bpf progs when step "allocate" can fail and
> efficient "stash it" can probably be done on top of the link list.

Do you imagine a BPF object that every user has to link into their programs (yuck),
or a different set of helpers? If it's helpers/kfuncs, I'm all for splitting things
this way.

If it's distributed separately, I think that's an unexpected burden on developers
(I'm thinking especially of tools not writing programs in C or using libbpf/bpftool
skels). There are no other bpf features that require a userspace support library like
this. (USDT doesn't count, uprobes are the underlying bpf feature and that is useful
without a library)
> 
> > *If* we ensure that you can build a usable freelist out of allocator-backed memory
> > for (a set of) nmi programs, then I can maybe get behind this (but there's other
> > reasons not to do this).
> 
> Agree that nmi adds another quirk to "stash it" step.
> If native link list is not going to work then something
> else would have to be designed.

Given that I'll be making the deferred work mechanism on top of this allocator, we
definitely need to cover nmi usage cleanly. It would be almost unusable to have your
deferred work fail to submit randomly.

> 
> > > So option 3 doesn't feel less flexible to me. imo the whole-map-allocator is
> > > more than we need. Ideally it would be easy to specifiy one single
> > > allocator for all maps and progs in a set of .c files. Sort-of a bpf package.
> > > In other words one bpf allocator per bpf "namespace" is more than enough.
> > 
> > _Potentially_. Programs need to know that when they reserved X objects, they'll have
> > them available at a later time and any sharing with other programs can remove this
> > property.
> 
> Agree.
> 
> > A _set_ of programs can in theory determine the right prefill levels, but
> > this is certainly easier to reason about on a per-program basis, given that programs
> > will run at different rates.
> 
> Agree as well.
> 
> > Why does it require a global allocator? For example, you can have each program have
> > its own internal allocator and with runtime live counts, this API is very achievable.
> > Once the program unloads, you can drain the freelists, so most allocator memory does
> > not have to live as long as the longest-lived object from that allocator. In
> > addition, all allocators can share a global freelist too, so chunks released after
> > the program unloads get a chance to be reused.
> 
> All makes sense to me except that the kernel can provide that
> global allocator and per-program "allocators" can hopefully be
> implemented as native bpf code.
> 
> > How is having one allocator per program different from having one allocator per set
> > of programs, with per-program bpf-side freelists? The requirement that some (most?)
> > programs need deterministic access to their freelists is still there, no matter the
> > number of allocators. If we fear that the default freelist behavior will waste
> > memory, then the defaults need to be aggressively conservative, with programs being
> > able to adjust them.
> 
> I think the disagreement here is that per-prog allocator based
> on bpf_mem_alloc isn't going to be any more deterministic than
> one global bpf_mem_alloc for all progs.
> Per-prog short free list of ~64 elements vs
> global free list of ~64 elements.

Right, I think I had a hidden assumption here that we've exposed. 
Namely, I imagined that after a mem_prefill(1000, struct foo) call, there would be
1000 struct foos on the freelist and the freelist thresholds would be adjusted
accordingly (i.e., you can free all of them and allocate them again, all from the
freelist).

Ultimately, that's what nmi programs actually need but I see why that's not an
obvious behavior.

> In both cases these lists will have to do irq_work and refill
> out of global slabs.

If a tracing program needs irq_work to refill, then hasn't the API already failed the
program writer? I'd have to remind myself how irq_work actually works but given that
it's a soft/hardirq, an nmi program can trivially exhaust the entire allocator before
irq_work has a chance to refill it. I don't see how you'd write reliable programs
this way.

> 
> > Besides, if we punt the freelists to bpf, then we get absolutely no control over the
> > memory usage, which is strictly worse for us (and worse developer experience on top).
> 
> I don't understand this point.
> All allocations are still coming out of bpf_mem_alloc.
> We can have debug mode with memleak support and other debug
> mechanisms.

I mostly mean accounting here. If we segment the allocated objects by finer-grained
allocators, we can attribute them to individual programs better. But, I agree, this
can be implemented in other ways, it can just be counts/tables on bpf_prog.

> 
> > > (The profileration of kmem_cache-s in the past
> > > forced merging of them). By restricting bpf program choices with allocator-per-map
> > > (this option 3) we're not only making the kernel side to do less work
> > > (no run-time ref counts, no merging is required today), we're also pushing
> > > bpf progs to use memory concious choices.
> > 
> > This is conflating "there needs to be a limit on memory stuck in freelists" with "you
> > can only store kptrs from one allocator in each map." The former practically
> > advocates for freelists to _not_ be hand-rolled inside bpf progs. I still disagree
> > with the latter - it's coming strictly from the desire to have static mappings
> > between object storage and allocators; it's not coming from a memory usage need, it
> > only avoids runtime live object counts.
> > 
> > > Having said all that maybe one global allocator is not such a bad idea.
> > 
> > It _is_ a bad idea because it doesn't have freelist usage determinism. I do, however,
> > think there is value in having precise and conservative freelist policies, along with
> > a global freelist for overflow and draining after program unload. The latter would
> > allow us to share memory between allocators without sacrificing per-allocator
> > freelist determinism, especially if paired with very static (but configurable)
> > freelist thresholds.
> 
> What is 'freelist determinism' ?

The property that prefill keeps all objects on the freelist, so the following
sequence doesn't observe allocation failures:

bpf_mem_prefill(1000, struct foo);
run_1000_times { alloc(struct foo); }
run_1000_times { free(struct foo); }
run_1000_times { alloc(struct foo); }
alloc(struct foo) // this can observe a failure

> Are you talking about some other freelist on top of bpf_mem_alloc's
> free lists ?

Well, that's the question, isn't it? I think it should be part of the bpf kernel
ecosystem (helper/kfunc) but it doesn't have to be bpf_mem_alloc itself. And, if it's
instantiated per-program, that's easiest to reason about.

-- Delyan