Re: [PATCH v3 bpf-next 00/15] bpf: BPF specific memory allocator, UAPI in particular

Delyan Kratunov <delyank@xxxxxx> · Tue, 30 Aug 2022 20:31:55 +0000

On Mon, 2022-08-29 at 23:03 -0700, Alexei Starovoitov wrote:
> 
> On Mon, Aug 29, 2022 at 10:03 PM Kumar Kartikeya Dwivedi
> <memxor@xxxxxxxxx> wrote:
> > > 
> > > So here is new proposal:
> > > 
> > 
> > Thanks for the proposal, Alexei. I think we're getting close to a
> > solution, but still some comments below.
> > 
> > > At load time the verifier walks all kptr_xchg(map_value, obj)
> > > and adds obj's allocator to
> > > map->used_allocators <- {kptr_offset, allocator};
> > > If kptr_offset already exists -> failure to load.
> > > Allocator can probably be a part of struct bpf_map_value_off_desc.
> > > 
> > > In other words the pairs of {kptr_offset, allocator}
> > > say 'there could be an object from that allocator in
> > > that kptr in some map values'.
> > > 
> > > Do nothing at prog unload.
> > > 
> > > At map free time walk all elements and free kptrs.
> > > Finally drop allocator refcnts.
> > > 
> > 
> > Yes, this should be possible.
> > It's quite easy to capture the map_ptr for the allocated local kptr.
> > Limiting each local kptr to one allocator is probably fine, at least for a v1.
> > 
> > One problem I see is how it works when the allocator map is an inner map.
> > Then, it is not possible to find the backing allocator instance at
> > verification time, hence not possible to take the reference to it in
> > map->used_allocators.
> > But let's just assume that is disallowed for now.
> > 
> > The other problem I see is that when the program just does
> > kptr_xchg(map_value, NULL), we may not have allocator info from
> > kptr_offset at that moment. Allocating prog which fills
> > used_allocators may be verified later. We _can_ reject this, but it
> > makes everything fragile (dependent on which order you load programs
> > in), which won't be great. You can then use this lost info to make
> > kptr disjoint from allocator lifetime.
> > 
> > Let me explain through an example.
> > 
> > Consider this order to set up the programs:
> > One allocator map A.
> > Two hashmaps M1, M2.
> > Three programs P1, P2, P3.
> > 
> > P1 uses M1, M2.
> > P2 uses A, M1.
> > P3 uses M2.
> > 
> > Sequence:
> > map_create A, M1, M2.
> > 
> > Load P1, uses M1, M2. What this P1 does is:
> > p = kptr_xchg(M1.value, NULL);
> > kptr_xchg(M2.value, p);
> > 
> > So it moves the kptr in M1 into M2. The problem is at this point
> > kptr_offset is not populated, so we cannot fill used_allocators of M2
> > as we cannot track which allocator is used to fill M1.value. We saw
> > nothing filling it yet.
> > 
> > Next, load P3. It does:
> > p = kptr_xchg(M2.value, NULL);
> > unit_free(p); // let's assume p has bpf_mem_alloc ptr behind itself so
> > this is ok if allocator is alive.
> > 
> > Again, M2.used_allocators is empty. Nothing is filled into it.
> > 
> > Next, load P2.
> > p = alloc(&A, ...);
> > kptr_xchg(M1.value, p);
> > 
> > Now, M1.used_allocators is filled with allocator ref and kptr_offset.
> > But M2.used_allocators will remain unfilled.
> > 
> > Now, run programs in sequence of P2, then P1. This will allocate from
> > A, and move the ref to M1, then to M2. But only P1 and P2 have
> > references to M1 so it keeps the allocator alive. However, now unload
> > both P1 and P2.
> > P1, P2, A, allocator of A, M1 all can be freed after RCU gp wait. M2
> > is still held by loaded P3.
> > 
> > Now, M2.used_allocators is empty. P3 is using it, and it is holding
> > allocation from allocator A. Both M1 and A are freed.
> > When P3 runs now, it can kptr_xchg and try to free it, and the same
> > uaf happens again.
> > If not that, uaf when M2 is freed and it does unit_free on the alive local kptr.
> > 
> > --
> > 
> > Will this case be covered by your approach? Did I miss something?
> > 
> > The main issue is that this allocator info can be lost depending on
> > how you verify a set of programs. It would not be lost if we verified
> > in order P2, P1, P3 instead of the current P1, P3, P2.
> > 
> > So we might have to teach the verifier to identify kptr_xchg edges
> > between maps, and propagate any used_allocators to the other map? But
> > it's becoming too complicated.
> > 
> > You _can_ reject loads of programs when you don't find kptr_offset
> > populated on seeing kptr_xchg(..., NULL), but I don't think this is
> > practical either. It makes the things sensitive to program
> > verification order, which would be confusing for users.
> 
> Right. Thanks for brainstorming and coming up with the case
> where it breaks.
> 
> Let me explain the thought process behind the proposal.
> The way the progs will be written will be something like:
> 
> struct foo {
>   int var;
> };
> 
> struct {
>   __uint(type, BPF_MAP_TYPE_ALLOCATOR);
>   __type(value, struct foo);
> } ma SEC(".maps");
> 
> struct map_val {
>   struct foo * ptr __kptr __local;
> };
> 
> struct {
>   __uint(type, BPF_MAP_TYPE_HASH);
>   __uint(max_entries, 123);
>   __type(key, __u32);
>   __type(value, struct map_val);
> } hash SEC(".maps");
> 
> struct foo * p = bpf_mem_alloc(&ma, type_id_local(struct foo));
> bpf_kptr_xchg(&map_val->ptr, p);
> 
> Even if prog doesn't allocate and only does kptr_xchg like
> your P1 and P3 do the C code has to have a full
> definition 'struct foo' to compile P1 and P3.
> P1 and P3 don't need to see definition of 'ma' to be compiled,
> but 'struct foo' has to be seen.
> BTF reference will be taken by both 'ma' and by 'hash'.
> The btf_id will come from that BTF.
> 
> The type is tied to BTF and tied to kptr in map value.
> The type is also tied to the allocator.
> The type creates a dependency chain between allocator and map.
> So the restriction of one allocator per kptr feels
> reasonable and doesn't feel restrictive at all.
> That dependency chain is there in the C code of the program.
> Hence the proposal to discover this dependency in the verifier
> through tracking of allocator from mem_alloc into kptr_xchg.
> But you're correct that it's not working for P1 and P3.

Encoding the allocator in the runtime type system is fine but it comes with its own
set of tradeoffs.

> 
> I can imagine a few ways to solve it.
> 1. Ask users to annotate kptr local with the allocator
> that will be used.
> It's easy for progs P1 and P3. All definitions are likely available.
> It's only an extra tag of some form.

This would introduce maps referring to other maps and would thus require substantial
work in libbpf. In order to encode the link specific to field instead of the whole
map object, we'd have to resort to something like map names as type tags, which is
not a great design (arbitrary strings etc). 

> 2. move 'used_allocator' from map further into BTF,
>   since BTF is the root of this dependency chain.

This would _maybe_ work. The hole I can see in this plan is that once a slot is
claimed, it cannot be unclaimed and thus maps can remain in a state that leaves the
local kptr fields useless (i.e. the allocator is around but no allocator map for it
exists and thus no program can use these fields again, they can only be free()).

The reason it can't be unclaimed is that programs were verified with a specific
allocator value in mind and we can't change that after they're loaded. To be able to
unclaim a slot, we'd need to remove everything using that system (i.e. btf object)
and load it again, which is not great.

> When the verifier sees bpf_mem_alloc in P2 it will add
> {allocator, btf_id} pair to BTF.
> 
> If P1 loads first and the verifier see:
> p = kptr_xchg(M1.value, NULL);
> it will add {unique_allocator_placeholder, btf_id} to BTF.
> Then
> kptr_xchg(M2.value, p); does nothing.
> The verifier checks that M1's BTF == M2's BTF and id-s are same.

Note to self: don't allow setting base_btf from userspace without auditing all these
checks.

> 
> Then P3 loads with:
> p = kptr_xchg(M2.value, NULL);
> unit_free(p);
> since unique_allocator_placholder is already there for that btf_id
> the verifier does nothing.
> 
> Eventually it will see bpf_mem_alloc for that btf_id and will
> replace the placeholder with the actual allocator.
> We can even allow P1 and P3 to be runnable after load right away.
> Since nothing can allocate into that kptr local those
> kptr_xchg() in P1 and P3 will be returning NULL.
> If P2 with bpf_mem_alloc never loads it's fine. Not a safety issue.
> 
> Ideally for unit_free(p); in P3 the verifier would add a hidden
> 'ma' argument, so that allocator doesn't need to be stored dynamically.
> We can either insns of P3 after P2 was verified or
> pass a pointer to a place in BTF->used_allocator list of pairs
> where actual allocator pointer will be written later.
> Then no patching is needed.
> If P2 never loads the unit_free(*addr /* here it will load the
> value of unique_allocator_placeholder */, ...)
> but since unit_free() will never execute with valid obj to be freed.
> 
> "At map free time walk all elements and free kptrs" step stays the same.
> but we decrement allocator refcnt only when BTF is freed
> which should be after map free time.
> So btf_put(map->btf); would need to move after ops->map_free.
> 
> Maybe unique_allocator_placeholder approach can be used
> without moving the list into BTF. Need to think more about it tomorrow.

I don't think we have to resort to storing the type-allocator mappings in the BTF at
all.

In fact, we can encode them where you wanted to encode them with tags - on the fields
themselves. We can put the mem_alloc reference in the kptr field descriptors and have
it transition to a specific allocator irreversibly. We would need to record where any
equivalences between allocators we need for the currently loaded programs but it's
not impossible.

Making the transition reversible is the hard part of course.

Taking a step back, we're trying to convert a runtime property (this object came from
this allocator) into a static property. The _cleanest_ way to do this would be to
store the dependencies explicitly and propagate/dissolve them on program load/unload.
Note that only programs introduce new dependency edges, maps on their own do not
mandate dependencies but stored values can extend the lifetime of a dependency chain.

There are only a couple of types of dependencies:
Program X stores allocation from Y into map Z, field A
Program X requires allocator for Y.A to equal allocator for Z.A (+ a version for
inner maps)
Program X uses field Z.A (therefore Z.A values may live on stack, so you can't walk
the map yet)

On program load, we materialize these in a table. They can have placeholder values or
take values from existing maps.

When a program loads that makes a placeholder a concrete allocator instance, we
propagate that to all related dependencies (it's kinda like constant propagation).
Calls to bpf_obj_free can receive a bpf_mem_alloc** to a field in that program's
dependency in this table.

On program unload, we can tag the relationships the program introduced as ready to
discard. Once the entire chain is ready to discard, no programs require this
relationship, so we can potentially walk the map and if all the values are NULL,
dissolve the relationship. The map field can now be used with a different allocator.

If there's a non-NULL value in the map, we can't do much - we need a program to load
that uses this map and on that program's unload, we can check again. On map free, we
can free the values, of course, but we can't remove the dependency edges from the
table, since values may have propagated to other tables (this depends on the concrete
implementation - we might be able to have the map remove all edges that reference
it).

Once all relationships for allocator A are gone, we can destroy it safely.

This scheme allows for reuse of maps so long as the values get cleared before
attempts to store in that field from a second allocator. It does not allow mixing
allocators for the same field.

I may have missed relationships but they too can follow this pattern - you store them
explicitly, then reason about them at program load/unload or map free.

Anything less explicit is doing this anyway, either less precisely or less clearly.

All this machinery _does_ create the closest approximation of the runtime property
(per field instead of per value) but it's also approximating something that the
allocator can easily keep track of itself, precisely, at runtime.

I don't think it's worth the complexity, explicit or not.

-- Delyan