Re: [PATCH v3 bpf-next 00/15] bpf: BPF specific memory allocator, UAPI in particular

Delyan Kratunov <delyank@xxxxxx> · Tue, 30 Aug 2022 01:40:32 +0000

On Mon, 2022-08-29 at 18:05 -0700, Alexei Starovoitov wrote:
> !-------------------------------------------------------------------|
>   This Message Is From an External Sender
> 
> > -------------------------------------------------------------------!
> 
> On Mon, Aug 29, 2022 at 5:45 PM Kumar Kartikeya Dwivedi
> <memxor@xxxxxxxxx> wrote:
> > 
> > On Tue, 30 Aug 2022 at 02:26, Alexei Starovoitov
> > <alexei.starovoitov@xxxxxxxxx> wrote:
> > > 
> > > On Mon, Aug 29, 2022 at 5:20 PM Kumar Kartikeya Dwivedi
> > > <memxor@xxxxxxxxx> wrote:
> > > > 
> > > > On Tue, 30 Aug 2022 at 01:45, Alexei Starovoitov
> > > > <alexei.starovoitov@xxxxxxxxx> wrote:
> > > > > 
> > > > > On Mon, Aug 29, 2022 at 4:18 PM Delyan Kratunov <delyank@xxxxxx> wrote:
> > > > > > 
> > > > > > > 
> > > > > > > It is not very precise, but until those maps are gone it delays
> > > > > > > release of the allocator (we can empty all percpu caches to save
> > > > > > > memory once bpf_map pinning the allocator is gone, because allocations
> > > > > > > are not going to be served). But it allows unit_free to be relatively
> > > > > > > less costly as long as those 'candidate' maps are around.
> > > > > > 
> > > > > > Yes, we considered this but it's much easier to get to pathological behaviors, by
> > > > > > just loading and unloading programs that can access an allocator in a loop. The
> > > > > > freelists being empty help but it's still quite easy to hold a lot of memory for
> > > > > > nothing.
> > > > > > 
> > > > > > The pointer walk was proposed to prune most such pathological cases while still being
> > > > > > conservative enough to be easy to implement. Only races with the pointer walk can
> > > > > > extend the lifetime unnecessarily.
> > > > > 
> > > > > I'm getting lost in this thread.
> > > > > 
> > > > > Here is my understanding so far:
> > > > > We don't free kernel kptrs from map in release_uref,
> > > > > but we should for local kptrs, since such objs are
> > > > > not much different from timers.
> > > > > So release_uref will xchg all such kptrs and free them
> > > > > into the allocator without touching allocator's refcnt.
> > > > > So there is no concurrency issue that Kumar was concerned about.
Yes, if we populate used_allocators on load and copy them to inner maps, this might
work. It requires the most conservative approach where loading and unloading a
program in a loop would add its allocators to the visible pinned maps, accumulating
allocators we can't release until the map is gone.

However, I thought you wanted to walk the values instead to prevent this abuse. At
least that's the understanding I was operating under.

If a program has the max number of possible allocators and we just load/unload it in
a loop, with a visible pinned map, the used_allocators list of that map can easily
skyrocket. This is a problem in itself, in that it needs to grow dynamically (program
load/unload is a good context to do that from but inner map insert/delete can also
grow the list and that's in a bad context potentially).

> > > > 
> > > > Haven't really thought through whether this will fix the concurrent
> > > > kptr swap problem, but then with this I think you need:
> > > > - New helper bpf_local_kptr_xchg(map, map_value, kptr)
> > > 
> > > no. why?
> > > current bpf_kptr_xchg(void *map_value, void *ptr) should work.
> > > The verifier knows map ptr from map_value.
> > > 
> > > > - Associating map_uid of map, map_value
> > > > - Always doing atomic_inc_not_zero(map->usercnt) for each call to
> > > > local_kptr_xchg
> > > > 1 and 2 because of inner_maps, 3 because of release_uref.
> > > > But maybe not a deal breaker?
> > > 
> > > No run-time refcnts.
> > 
> > How is future kptr_xchg prevented for the map after its usercnt drops to 0?
> > If we don't check it at runtime we can xchg in non-NULL kptr after
> > release_uref callback.
> > For timer you are taking timer spinlock and reading map->usercnt in
> > timer_set_callback.
> 
> Sorry I confused myself and others with release_uref.
> I meant map_poke_untrack-like call.
> When we drop refs from used maps in __bpf_free_used_maps
> we walk all elements.
> Similar idea here.
> When prog is unloaded it cleans up all objects it allocated
> and stored into maps before dropping refcnt-s
> in prog->used_allocators.

For an allocator that's visible from multiple programs (say, it's in a map-of-maps),
how would we know which values were allocated by which program? Do we forbid shared
allocators outright? I can imagine container agent-like software wanting to reuse its
allocator caches from a previous run.

Besides, cleaning up the values is the easy part - so long as the map is extending
the allocator's lifetime, this is safe, we can even use the kptr destructor mechanism
(though I'd rather not).

There are three main cases of lifetime extension:
1) Directly visible maps - normal used_allocators with or without a walk works here.
2) Inner maps. I plan to straight up disallow their insertion if they have a
kptr_local field. If someone needs this, we can solve it then, it's too hard to do
the union of lists logic cleanly for a v1.
3) Stack of another program. I'm not sure how to require that programs with
kptr_local do not use the same btf structs but if that's possible, this can naturally
be disallowed via the btf comparisons failing (it's another case of btf domain
crossover). Maybe we tag the btf struct as containing local kptr type tags and
disallow its use for more than one program?

Separately, I think I just won't allow allocators as inner maps, that's for another
day too (though it may work just fine).

Perfect, enemy of the good, something-something.

-- Delyan