Re: [PATCH v3 bpf-next 00/15] bpf: BPF specific memory allocator, UAPI in particular

Delyan Kratunov <delyank@xxxxxx> · Wed, 31 Aug 2022 17:38:15 +0000

On Tue, 2022-08-30 at 18:52 -0700, Alexei Starovoitov wrote:
> > [SNIP]
> > This would _maybe_ work. The hole I can see in this plan is that once a slot is
> > claimed, it cannot be unclaimed and thus maps can remain in a state that leaves the
> > local kptr fields useless (i.e. the allocator is around but no allocator map for it
> > exists and thus no program can use these fields again, they can only be free()).
> 
> That's correct if we think of allocators as property of programs, but see below.

Allocators are not properties of programs, allocator _access_ is. Allocators are
properties of the objects allocated from them, programs and maps only participate in
the lifecycle of the objects and thus extend the lifetime of the allocator.

> 
> > The reason it can't be unclaimed is that programs were verified with a specific
> > allocator value in mind and we can't change that after they're loaded. To be able to
> > unclaim a slot, we'd need to remove everything using that system (i.e. btf object)
> > and load it again, which is not great.
> 
> That's also correct.
> 
> > 
> > > When the verifier sees bpf_mem_alloc in P2 it will add
> > > {allocator, btf_id} pair to BTF.
> > > 
> > > If P1 loads first and the verifier see:
> > > p = kptr_xchg(M1.value, NULL);
> > > it will add {unique_allocator_placeholder, btf_id} to BTF.
> > > Then
> > > kptr_xchg(M2.value, p); does nothing.
> > > The verifier checks that M1's BTF == M2's BTF and id-s are same.
> > 
> > Note to self: don't allow setting base_btf from userspace without auditing all these
> > checks.
> > 
> > > 
> > > Then P3 loads with:
> > > p = kptr_xchg(M2.value, NULL);
> > > unit_free(p);
> > > since unique_allocator_placholder is already there for that btf_id
> > > the verifier does nothing.
> > > 
> > > Eventually it will see bpf_mem_alloc for that btf_id and will
> > > replace the placeholder with the actual allocator.
> > > We can even allow P1 and P3 to be runnable after load right away.
> > > Since nothing can allocate into that kptr local those
> > > kptr_xchg() in P1 and P3 will be returning NULL.
> > > If P2 with bpf_mem_alloc never loads it's fine. Not a safety issue.
> > > 
> > > Ideally for unit_free(p); in P3 the verifier would add a hidden
> > > 'ma' argument, so that allocator doesn't need to be stored dynamically.
> > > We can either insns of P3 after P2 was verified or
> > > pass a pointer to a place in BTF->used_allocator list of pairs
> > > where actual allocator pointer will be written later.
> > > Then no patching is needed.
> > > If P2 never loads the unit_free(*addr /* here it will load the
> > > value of unique_allocator_placeholder */, ...)
> > > but since unit_free() will never execute with valid obj to be freed.
> > > 
> > > "At map free time walk all elements and free kptrs" step stays the same.
> > > but we decrement allocator refcnt only when BTF is freed
> > > which should be after map free time.
> > > So btf_put(map->btf); would need to move after ops->map_free.
> > > 
> > > Maybe unique_allocator_placeholder approach can be used
> > > without moving the list into BTF. Need to think more about it tomorrow.
> > 
> > I don't think we have to resort to storing the type-allocator mappings in the BTF at
> > all.
> > 
> > In fact, we can encode them where you wanted to encode them with tags - on the fields
> > themselves. We can put the mem_alloc reference in the kptr field descriptors and have
> > it transition to a specific allocator irreversibly. We would need to record where any
> > equivalences between allocators we need for the currently loaded programs but it's
> > not impossible.
> > 
> > Making the transition reversible is the hard part of course.
> > 
> > Taking a step back, we're trying to convert a runtime property (this object came from
> > this allocator) into a static property. The _cleanest_ way to do this would be to
> > store the dependencies explicitly and propagate/dissolve them on program load/unload.
> 
> Agree with above.
> 
> > Note that only programs introduce new dependency edges, maps on their own do not
> > mandate dependencies but stored values can extend the lifetime of a dependency chain.
> 
> I think we're getting to the bottom of the issues with this api.
> The api is designed around the concept of an allocator being another map type.
> That felt correct in the beginning, but see below.
> 
> > There are only a couple of types of dependencies:
> > Program X stores allocation from Y into map Z, field A
> > Program X requires allocator for Y.A to equal allocator for Z.A (+ a version for
> > inner maps)
> 
> What does it mean for two allocators to be equivalent?
> Like mergeable slabs? If so the kernel gave the answer to that question long ago:
> merge them whenever possible.
> bpf-speak if two allocators have the same type they should be mergeable (== equivalent).

The requirement to track (allocator, type) tuples came from the desire to manage
allocator lifetime statically, it's not inherent in the problem. It allows us to
narrow down the problem space while retaining some flexibility in that multiple
allocators can participate in the same map.

> Now let's come up with a use case when bpf core would need to recognize two
> equivalent allocators.
> Take networking service, like katran. It has a bunch of progs that are using
> pinned maps. The maps are pinned because progs can be reloaded due to
> live-upgrade or due to user space restart. The map contents are preserved, but
> the progs are reloaded. That's a common scenario.
> If we design allocator api in a way that allocator is another map. The prog
> reload case has two options:
> - create new allocator map and use it in reloaded progs
> - pin allocator map along with other maps at the very beginning and
>   make reloaded progs use it
> 
> In the later case there is no new allocator. No need to do equivalence checks.
> In the former case there is a new allocator and bpf core needs to recognize
> equivalence otherwise pinned maps with kptrs won't be usable out of reloaded
> progs. But what are the benefits of creating new allocators on reload?
> I don't see any. Old allocator had warm caches populated with cache friendly
> objects. New allocator has none of it, but it's going to be used for exactly
> the same objects. What's worse. New allocator will likely bring a new BTF object
> making equivalence work even harder.

>From an optimal behavior point of view, I agree that starting a new allocator is not
ideal. However, if a design allows suboptimal behavior, it _will_ be used by users,
so it must be correct.

> 
> I think it all points to the design issue where allocator is another map and
> dependency between maps is a run-time property. As you pointed out earlier the
> cleanest way is to make this dependency static. Turned out we have such static
> dependency already. Existing bpf maps depend on kernel slab allocator. After
> bpf_mem_alloc patches each hash map will have a hidden dependency on that
> allocator.
> 
> What we've designed so far is:
> 
> struct foo {
>   int var;
> };
> 
> struct {
>   __uint(type, BPF_MAP_TYPE_ALLOCATOR);
>   __type(value, struct foo);
> } ma SEC(".maps");
> 
> struct map_val {
>   struct foo * ptr __kptr __local;
> };
> 
> struct {
>   __uint(type, BPF_MAP_TYPE_HASH);
>   __uint(max_entries, 123);
>   __type(key, __u32);
>   __type(value, struct map_val);
> } hash SEC(".maps");
> 
> struct foo * p = bpf_mem_alloc(&ma, type_id_local(struct foo));
> bpf_kptr_xchg(&map_val->ptr, p);
> 
> /* this is a copy-paste of my earlier example. no changes */
> 
> but what wasn't obvious that we have two allocators here.
> One explicit 'ma' and another hidden allocator in 'hash'.
> Both are based on 'struct bpf_mem_alloc'.
> One will be allocating 'struct foo'. Another 'struct map_val'.
> But we missed opportunity to make them mergeable and
> bpf prog cannot customize them.
> 
> I think the way to fix the api is to recognize:
> - every map has an allocator

Array maps don't or at least not in a way that matters. But sure.

> - expose that allocator to progs

Good idea on its own.

> - allow sharing of allocators between maps

Hidden assumption here (see below).

> 
> so an allocator is a mandatory part of the map.
> If it's not specified an implicit one will be used.
> Thinking along these lines we probably need map-in-map-like
> specification of explicit allocator(s) for maps:
> 
> struct {
>   __uint(type, BPF_MAP_TYPE_HASH);
>   __uint(max_entries, 123);
>   __type(key, __u32);
>   __type(value, struct map_val);
>   __array(elem_allocator, struct {
>       __uint(type, BPF_MAP_TYPE_ALLOCATOR);
>   });
>   __array(kptr_allocator, struct {
>       __uint(type, BPF_MAP_TYPE_ALLOCATOR);
>   });
> } hash SEC(".maps");
> 
> That would be explicit and static declartion of allocators that
> hash map should use.

Adding a kptr_allocator customization mechanism to all maps is fine, though pretty
heavy-handed in terms of abstraction leakage. elem_allocator doesn't make sense for
all maps.

> The benefits:
> - the prog writers can share the same allocator across multiple
> hash maps by specifying the same 'elem_allocator'.
> (struct bpf_mem_alloc already supports more than one size)
> The most memory concious authors can use the same allocator
> across all of their maps achieving the best memory savings.
> Not talking about kptr yet. This is just plain hash maps.
> 
> - the program can influence hash map allocator.
> Once we have bpf_mem_prefill() helper the bpf program can add
> reserved elements to a hash map and guarantee that
> bpf_map_update_elem() will succeed later.

I like these extensions, they make sense.

> 
> - kptr allocator is specified statically. At map creation time
> the map will take reference of allocator and will keep it until
> destruction. The verifier will make sure that kptr_xchg-ed objects
> come only from that allocator.
> So all of the refcnt issues discussed in this thread are gone.
> The prog will look like:
> 
> struct {
>   __uint(type, BPF_MAP_TYPE_ALLOCATOR);
> } ma SEC(".maps");
> 
> struct {
>   __uint(type, BPF_MAP_TYPE_HASH);
>   __type(value, struct map_val);
>   __array(kptr_allocator, struct {
>       __uint(type, BPF_MAP_TYPE_ALLOCATOR);
>   });
> } hash SEC(".maps") = {
>         .kptr_allocator = { (void *)&ma },
> };

This part is going to be painful to implement but plausible. I also have to verify
that this will emit relocations and not global initializers (I don't recall the exact
rules).

> struct foo * p = bpf_mem_alloc(&ma, type_id_local(struct foo));
> 
> If allocated kptr-s don't need to be in multiple maps the prog can be:
> 
> struct {
>   __uint(type, BPF_MAP_TYPE_HASH);
>   __type(value, struct map_val);
>   __array(kptr_allocator, struct {
>       __uint(type, BPF_MAP_TYPE_ALLOCATOR);
>   });
> } hash SEC(".maps");
> 
> struct foo * p = bpf_mem_alloc(&hash->kptr_allocator, type_id_local(struct foo));
> 
> The same hash->kptr_allocator can be used to allocate different types.
> We can also allow the same allocator to be specified in hash->elem_allocator
> and hash->kptr_allocator.

The need to extract the allocator into its own map in order to have kptrs of the same
type in more than one map is definitely a convenience downgrade. 

We're now to a point where we're not even putting the onus of dependency tracking on
the verifier but entirely on the developer, while making the dependency tracking less
granular. On the spectrum where on one side we have the allocator tracking its
lifetime at runtime in a way where allocators can be easily mixed in the same map,
through verifier smarts with _some_ restrictions (per field), to this design, this
feels like the most restrictive version from a developer experience pov.

I personally also don't like APIs where easy things are easy but the moment you want
to do something slightly out of the blessed path, you hit a cliff and have to learn
about a bunch of things you don't care about. "I just want to move a pointer from
here to over there, why do I have to refactor all my maps?"

> 
> We can allow progs to skip declaring explicit BPF_MAP_TYPE_ALLOCATOR.
> Like:
> struct {
>   __uint(type, BPF_MAP_TYPE_HASH);
>   __type(value, struct map_val);
> } hash SEC(".maps");
> 
> struct foo * p = bpf_mem_alloc(&hash->elem_allocator, type_id_local(struct foo));
> 
> In this case both kptr objects and map elements come from the same allocator
> that was implicitly created at hash map creation time.

Overall, this design (or maybe the way it's presented here) conflates a few ideas.

1) The extensions to expose and customize map's internal element allocator are fine
independently of even this patchset.

2) The idea that kptrs in a map need to have a statically identifiable allocator is
taken as an axiom, and then expanded to its extreme (single allocator per map as
opposed to the smarter verifier schemes). I still contest that this is not the case
and the runtime overhead it avoids is paid back in bad developer experience multiple
times over.

3) The idea that allocators can be merged between elements and kptrs is independent
of the static requirements. If the map's internal allocator is exposed per 1), we can
still use it to allocate kptrs but not require that all kptrs in a map are from the
same allocator.

Going this coarse in the API is easy for us but fundamentally more limiting for
developers. It's not hard to imagine situations where the verifier dependency
tracking or runtime lifetime tracking would allow for pinned maps to be retained but
this scheme would require new maps entirely. (Any situation where you just refactored
the implicit allocator out to share it, for example)

I also don't think that simplicity for us (a one time implementation cost +
continuous maintenance cost) trumps over long term developer experience (a much
bigger implementation cost over a much bigger time span).

> 
> The prog reload use case is naturally solved.
> By pinning map the user space pinned allocators and prog reload will reuse
> the same maps with the same allocators.
> 
> If this design revision makes sense the bpf_mem_alloc needs a bit of work
> to support the above cleanly. It should be straightforward.
> 
> > If there's a non-NULL value in the map, we can't do much - we need a program to load
> > that uses this map and on that program's unload, we can check again. On map free, we
> > can free the values, of course, but we can't remove the dependency edges from the
> > table, since values may have propagated to other tables (this depends on the concrete
> > implementation - we might be able to have the map remove all edges that reference
> > it).
> ...
> > I don't think it's worth the complexity, explicit or not.
> 
> The edge tracking dependency graph sounds quite complex and I agree that it's not worth it.

So far, my ranked choice vote is:

1) maximum flexibility and runtime live object counts (with exposed allocators, I
like the merging)
2) medium flexibility with per-field allocator tracking in the verifier and the
ability to lose the association once programs are unloaded and values are gone. This
also works better with exposed allocators since they are implicitly pinned and would
be usable to store values in another map.
3) minimum flexibility with static whole-map kptr allocators

-- Delyan