Re: [PATCH v3 bpf-next 00/15] bpf: BPF specific memory allocator, UAPI in particular

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Thu, 1 Sep 2022 17:12:11 -0700

On Thu, Sep 01, 2022 at 10:46:09PM +0000, Delyan Kratunov wrote:
> On Wed, 2022-08-31 at 20:55 -0700, Alexei Starovoitov wrote:
> > !-------------------------------------------------------------------|
> >   This Message Is From an External Sender
> > 
> > > -------------------------------------------------------------------!
> > 
> > On Wed, Aug 31, 2022 at 2:02 PM Delyan Kratunov <delyank@xxxxxx> wrote:
> > > Given that tracing programs can't really maintain their own freelists safely (I think
> > > they're missing the building blocks - you can't cmpxchg kptrs),
> > 
> > Today? yes, but soon we will have link lists supported natively.
> > 
> > > I do feel like
> > > isolated allocators are a requirement here. Without them, allocations can fail and
> > > there's no way to write a reliable program.
> > 
> > Completely agree that there should be a way for programs
> > to guarantee availability of the element.
> > Inline allocation can fail regardless whether allocation pool
> > is shared by multiple programs or a single program owns an allocator.
> 
> I'm not sure I understand this point. 
> If I call bpf_mem_prefill(20, type_id(struct foo)), I would expect the next 20 allocs
> for struct foo to succeed. In what situations can this fail if I'm the only program
> using it _and_ I've calculated the prefill amount correctly?
> 
> Unless you're saying that the prefill wouldn't adjust the freelist limits, in which
> case, I think that's a mistake - prefill should effectively _set_ the freelist
> limits.

There is no prefill implementation today, so we're just guessing, but let's try.
prefill would probably have to adjust high-watermark limit.
That makes sense, but for how long? Should the watermark go back after time
or after N objects were consumed?
What prefill is going to do? Prefill on current cpu only ?
but it doesn't help the prog to be deterministic in consuming them.
Prefill on all cpu-s? That's possible,
but for_each_possible_cpu() {irq_work_queue_on(cpu);}
looks to be a significant memory and run-time overhead.
When freelist is managed by the program it may contain just N elements
that progs needs.

> > In that sense, allowing multiple programs to create an instance
> > of an allocator doesn't solve this problem.
> > Short free list inside bpf_mem_cache is an implementation detail.
> > "prefill to guarantee successful alloc" is a bit out of scope
> > of an allocator.
> 
> I disagree that it's out of scope. This is the only access to dynamic memory from a
> bpf program, it makes sense that it covers the requirements of bpf programs,
> including prefill/freelist behavior, so all programs can safely use it.
> 
> > "allocate a set and stash it" should be a separate building block
> > available to bpf progs when step "allocate" can fail and
> > efficient "stash it" can probably be done on top of the link list.
> 
> Do you imagine a BPF object that every user has to link into their programs (yuck),
> or a different set of helpers? If it's helpers/kfuncs, I'm all for splitting things
> this way.

I'm assuming Kumar's proposed list api:
struct bpf_list_head head;
struct bpf_list_node node;
bpf_list_insert(&node, &head);

will work out.

> If it's distributed separately, I think that's an unexpected burden on developers
> (I'm thinking especially of tools not writing programs in C or using libbpf/bpftool
> skels). There are no other bpf features that require a userspace support library like
> this. (USDT doesn't count, uprobes are the underlying bpf feature and that is useful
> without a library)

bpf progs must not pay for what they don't use. Hence all building blocks should be
small. We will have libc-like bpf libraries with bigger blocks eventually. 

> > I think the disagreement here is that per-prog allocator based
> > on bpf_mem_alloc isn't going to be any more deterministic than
> > one global bpf_mem_alloc for all progs.
> > Per-prog short free list of ~64 elements vs
> > global free list of ~64 elements.
> 
> Right, I think I had a hidden assumption here that we've exposed. 
> Namely, I imagined that after a mem_prefill(1000, struct foo) call, there would be
> 1000 struct foos on the freelist and the freelist thresholds would be adjusted
> accordingly (i.e., you can free all of them and allocate them again, all from the
> freelist).
> 
> Ultimately, that's what nmi programs actually need but I see why that's not an
> obvious behavior.

How prefill is going to work is still to-be-designed.
In addition to current-cpu vs on-all-cpu question above...
Will prefill() helper just do irq_work ?
If so then it doesn't help nmi and irq-disabled progs at all.
prefill helper working asynchronously doesn't guarantee availability of objects
later to the program.
prefill() becomes a hint and probably useless as such.
So it should probably be synchronous and fail when in-nmi or in-irq?
But bpf prog cannot know its context, so only safe synchronous prefill()
would be out of sleepable progs.

> > In both cases these lists will have to do irq_work and refill
> > out of global slabs.
> 
> If a tracing program needs irq_work to refill, then hasn't the API already failed the
> program writer? I'd have to remind myself how irq_work actually works but given that
> it's a soft/hardirq, an nmi program can trivially exhaust the entire allocator before
> irq_work has a chance to refill it. I don't see how you'd write reliable programs
> this way.

The only way nmi-prog can guarantee availability if it allocates and reserves
objects in a different non-nmi program.
If everything runs in nmi there is nothing can be done.

> > 
> > > Besides, if we punt the freelists to bpf, then we get absolutely no control over the
> > > memory usage, which is strictly worse for us (and worse developer experience on top).
> > 
> > I don't understand this point.
> > All allocations are still coming out of bpf_mem_alloc.
> > We can have debug mode with memleak support and other debug
> > mechanisms.
> 
> I mostly mean accounting here. If we segment the allocated objects by finer-grained
> allocators, we can attribute them to individual programs better. But, I agree, this
> can be implemented in other ways, it can just be counts/tables on bpf_prog.

mem accounting is the whole different, huge and largely unsovled topic.
The thread about memcg and bpf is still ongoing.
The fine-grained bpf_mem_alloc isn't going to magically solve it.

> > 
> > What is 'freelist determinism' ?
> 
> The property that prefill keeps all objects on the freelist, so the following
> sequence doesn't observe allocation failures:
> 
> bpf_mem_prefill(1000, struct foo);
> run_1000_times { alloc(struct foo); }
> run_1000_times { free(struct foo); }
> run_1000_times { alloc(struct foo); }
> alloc(struct foo) // this can observe a failure

we cannot evalute the above until we answer current-cpu vs on-all-cpus question
and whether bpf_mem_prefill is sync or async.

I still think designing prefill and guaranteed availability is out of scope
of allocator.

> > Are you talking about some other freelist on top of bpf_mem_alloc's
> > free lists ?
> 
> Well, that's the question, isn't it? I think it should be part of the bpf kernel
> ecosystem (helper/kfunc) but it doesn't have to be bpf_mem_alloc itself. And, if it's
> instantiated per-program, that's easiest to reason about.

There should be a way. For sure. Helper/kfunc or trivial stuff on top of bpf_link_list
is still a question. Bundling this feature together with an allocator feels artificial.
In user space C you wouldn't combine tcmalloc with custom free list.
During early days of bpf bundling would totally make sense, but right now we're able to
create much smaller building blocks than in the past. I don't think we'll be adding
any more new map types. We probably won't be adding any new program types either.