Re: [PATCH v3 bpf-next 00/15] bpf: BPF specific memory allocator, UAPI in particular

Delyan Kratunov <delyank@xxxxxx> · Mon, 29 Aug 2022 23:18:25 +0000

On Tue, 2022-08-30 at 00:07 +0200, Kumar Kartikeya Dwivedi wrote:
> 
> On Mon, 29 Aug 2022 at 23:29, Delyan Kratunov <delyank@xxxxxx> wrote:
> > 
> > Thanks for taking a look, Kumar!
> > 
> > On Fri, 2022-08-26 at 06:03 +0200, Kumar Kartikeya Dwivedi wrote:
> > > > 
> > > > On Thu, 25 Aug 2022 at 02:56, Delyan Kratunov <delyank@xxxxxx> wrote:
> > > > > > 
> > > > > > Alexei and I spent some time today going back and forth on what the uapi to this
> > > > > > allocator should look like in a BPF program. To both of our surprise, the problem
> > > > > > space became far more complicated than we anticipated.
> > > > > > 
> > > > > > There are three primary problems we have to solve:
> > > > > > 1) Knowing which allocator an object came from, so we can safely reclaim it when
> > > > > > necessary (e.g., freeing a map).
> > > > > > 2) Type confusion between local and kernel types. (I.e., a program allocating kernel
> > > > > > types and passing them to helpers/kfuncs that don't expect them). This is especially
> > > > > > important because the existing kptr mechanism assumes kernel types everywhere.
> > > > 
> > > > Why is the btf_is_kernel(reg->btf) check not enough to distinguish
> > > > local vs kernel kptr?
> > 
> > Answered below.
> > 
> > > > We add that wherever kfunc/helpers verify the PTR_TO_BTF_ID right now.
> > > > 
> > > > Fun fact: I added a similar check on purpose in map_kptr_match_type,
> > > > since Alexei mentioned back then he was working on a local type
> > > > allocator, so forgetting to add it later would have been a problem.
> > > > 
> > > > > > 3) Allocated objects lifetimes, allocator refcounting, etc. It all gets very hairy
> > > > > > when you allow allocated objects in pinned maps.
> > > > > > 
> > > > > > This is the proposed design that we landed on:
> > > > > > 
> > > > > > 1. Allocators get their own MAP_TYPE_ALLOCATOR, so you can specify initial capacity
> > > > > > at creation time. Value_size > 0 takes the kmem_cache path. Probably with
> > > > > > btf_value_type_id enforcement for the kmem_cache path.
> > > > > > 
> > > > > > 2. The helper APIs are just bpf_obj_alloc(bpf_map *, bpf_core_type_id_local(struct
> > > > > > foo)) and bpf_obj_free(void *). Note that obj_free() only takes an object pointer.
> > > > > > 
> > > > > > 3. To avoid mixing BTF type domains, a new type tag (provisionally __kptr_local)
> > > > > > annotates fields that can hold values with verifier type `PTR_TO_BTF_ID |
> > > > > > BTF_ID_LOCAL`. obj_alloc only ever returns these local kptrs and only ever resolves
> > > > > > against program-local btf (in the verifier, at runtime it only gets an allocation
> > > > > > size).
> > > > 
> > > > This is ok too, but I think just gating everywhere with btf_is_kernel
> > > > would be fine as well.
> > 
> > 
> > Yeah, I can get behind not using BTF_LOCAL_ID as a type flag and just encoding that
> > in the btf field of the register/stack slot/kptr/helper proto. That said, we still
> > need the new type tag to tell the map btf parsing code to use the local btf in the
> > kptr descriptor.
> > 
> 
> Agreed, the new __local type tag looks necessary to make it search in
> map BTF instead.
> 
> > > > 
> > > > > > 3.1. If eventually we need to pass these objects to kfuncs/helpers, we can introduce
> > > > > > a new bpf_obj_export helper that takes a PTR_TO_LOCAL_BTF_ID and returns the
> > > > > > corresponding PTR_TO_BTF_ID, after verifying against an allowlist of some kind. This
> > > > 
> > > > It would be fine to allow passing if it is just plain data (e.g. what
> > > > scalar_struct check does for kfuncs).
> > > > There we had the issue where it can take PTR_TO_MEM, PTR_TO_BTF_ID,
> > > > etc. so it was necessary to restrict the kind of type to LCD.
> > > > 
> > > > But we don't have to do it from day 1, just listing what should be ok.
> > 
> > That's a good call, I'll add it to the initial can-transition-to-kernel-kptr logic.
> > 
> > > > 
> > > > > > would be the only place these objects can leak out of bpf land. If there's no runtime
> > > > > > aspect (and there likely wouldn't be), we might consider doing this transparently,
> > > > > > still against an allowlist of types.
> > > > > > 
> > > > > > 4. To ensure the allocator stays alive while objects from it are alive, we must be
> > > > > > able to identify which allocator each __kptr_local pointer came from, and we must
> > > > > > keep the refcount up while any such values are alive. One concern here is that doing
> > > > > > the refcount manipulation in kptr_xchg would be too expensive. The proposed solution
> > > > > > is to:
> > > > > > 4.1 Keep a struct bpf_mem_alloc* in the header before the returned object pointer
> > > > > > from bpf_mem_alloc(). This way we never lose track which bpf_mem_alloc to return the
> > > > > > object to and can simplify the bpf_obj_free() call.
> > > > > > 4.2. Tracking used_allocators in each bpf_map. When unloading a program, we would
> > > > > > walk all maps that the program has access to (that have kptr_local fields), walk each
> > > > > > value and ensure that any allocators not already in the map's used_allocators are
> > > > > > refcount_inc'd and added to the list. Do note that allocators are also kept alive by
> > > > > > their bpf_map wrapper but after that's gone, used_allocators is the main mechanism.
> > > > > > Once the bpf_map is gone, the allocator cannot be used to allocate new objects, we
> > > > > > can only return objects to it.
> > > > > > 4.3. On map free, we walk and obj_free() all the __kptr_local fields, then
> > > > > > refcount_dec all the used_allocators.
> > > > > > 
> > > > 
> > > > So to summarize your approach:
> > > > Each allocation has a bpf_mem_alloc pointer before it to track its
> > > > owner allocator.
> > > > We know used_maps of each prog, so during unload of program, walk all
> > > > local kptrs in each used_maps map values, and that map takes a
> > > > reference to the allocator stashing it in used_allocators list,
> > > > because prog is going to relinquish its ref to allocator_map (which if
> > > > it were the last one would release allocator reference as well for
> > > > local kptrs held by those maps).
> > > > Once prog is gone, the allocator is kept alive by other maps holding
> > > > objects allocated from it. References to the allocator are taken
> > > > lazily when required.
> > > > Did I get it right?
> > 
> > That's correct!
> > 
> > > > 
> > > > I see two problems: the first is concurrency. When walking each value,
> > > > it is going to be hard to ensure the kptr field remains stable while
> > > > you load and take ref to its allocator. Some other programs may also
> > > > have access to the map value and may concurrently change the kptr
> > > > field (xchg and even release it). How do we safely do a refcount_inc
> > > > of its allocator?
> > 
> > Fair question. You can think of that pointer as immutable for the entire time that
> > the allocator is able to interact with the object. Once the object makes it on a
> > freelist, it won't be released until an rcu gp has elapsed. Therefore, the first time
> > that value can change - when we return the object to the global kmalloc pool - it has
> > provably no bpf-side concurrent observers.
> > 
> 
> I don't think that assumption will hold. Requiring RCU protection for
> all local kptrs means allocator cache reuse becomes harder, as
> elements are stuck in freelist until the next grace period. It
> necessitates use of larger caches.
> For some use cases where they can tolerate reuse, it might not be
> optimal. IMO the allocator should be independent of how the lifetime
> of elements is managed.

All maps and allocations are already rcu-protected, we're not adding new here. We're
only relying on this rcu protection (c.f. the chain of call_rcu_task_trace and
call_rcu in the patchset) to prove that no program can observe an allocator pointer
that is corrupted or stale. 

> 
> That said, even if you assume RCU protection, that still doesn't
> address the real problem. Yes, you can access the value without
> worrying about it moving to another map, but consider this case:
> Prog unloading,
> populate_used_allocators -> map walks map_values, tries to take
> reference to local kptr whose backing allocator is A.
> Loads kptr, meanwhile some other prog sharing access to the map value
> exchanges (kptr_xchg) another pointer into that field.
> Now you take reference to allocator A in used_allocators, while actual
> value in map is for allocator B.

This is fine. The algorithm is conservative, it may keep allocators around longer
than they're needed. Eventually there will exist a time that this map won't be
accessible to any program, at which point both allocator A and B will be released.

It is possible to make a more precise algorithm but given that this behavior is only
really a problem with (likely pinned) long-lived maps, it's imo not worth it for v1.

> 
> So you either have to cmpxchg and then retry if it fails (which is not
> a wait-free operation, and honestly not great imo), or if you don't
> handle this:
> Someone moved an allocated local kptr backed by B into your map, but
> you don't hold reference to it. 

You don't need a reference while something else is holding the allocator alive. The
references exist to extend the lifetime past the lifetimes of programs that can
directly reference the allocator.

> That other program may release
> allocator map -> allocator, 

The allocator map cannot be removed while it's in used_maps of the program. On
program unload, we'll add B to the used_allocators list, as a value from B is in the
map. Only then will the allocator map be releasable.

> and then when this map is destroyed, on
> unit_free it will be use-after-free of bpf_mem_alloc *.
> 
> I don't see an easy way around these kinds of problems. And this is
> just one specific scenario.
> 
> > Alexei, please correct me if I misunderstood how the design is supposed to work.
> > 
> > > > 
> > > > For the second problem, consider this:
> > > > obj = bpf_obj_alloc(&alloc_map, ...);
> > > > inner_map = bpf_map_lookup_elem(&map_in_map, ...);
> > > > map_val = bpf_map_lookup_elem(inner_map, ...);
> > > > kptr_xchg(&map_val->kptr, obj);
> > > > 
> > > > Now delete the entry having that inner_map, but keep its fd open.
> > > > Unload the program, since it is map-in-map, no way to fill used_allocators.
> > > > alloc_map is freed, releases reference on allocator, allocator is freed.
> > > > Now close(inner_map_fd), inner_map is free. Either bad unit_free or memory leak.
> > > > Is there a way to prevent this in your scheme?
> > 
> > This is fair, inner maps not being tracked in used_maps is a wrench in that plan.
> > However, we can have the parent map propagate its used_allocators on inner map
> > removal.
> > 
> 
> But used_allocators are not populated until unload? inner_map removal
> can happen while the program is loaded/attached.
> Or will you populate it before unloading, everytime during inner_map
> removal? Then that would be very costly for a bpf_map_delete_elem...

It's not free, granted, but it only kicks if the map-of-maps has already acquired a
used_allocators list.

I'd be okay with handling map-of-map via liveness counts too.

> 
> > > > -
> > > > 
> > > > I had another idea, but it's not _completely_ 0 overhead. Heavy
> > > > prototyping so I might be missing corner cases.
> > > > It is to take reference on each allocation and deallocation. Yes,
> > > > naive and slow if using atomics, but instead we can use percpu_ref
> > > > instead of atomic refcount for the allocator. percpu_ref_get/put on
> > > > each unit_alloc/unit_free.
> > > > The problem though is that once initial reference is killed, it
> > > > downgrades to atomic, which will kill performance. So we need to be
> > > > smart about how that initial reference is managed.
> > > > My idea is that the initial ref is taken and killed by the allocator
> > > > bpf_map pinning the allocator. Once that bpf_map is gone, you cannot
> > > > do any more allocations anyway (since you need to pass the map pointer
> > > > to bpf_obj_alloc), so once it downgrades to atomics at that point we
> > > > will only be releasing the references after freeing its allocated
> > > > objects. Yes, then the free path becomes a bit costly after the
> > > > allocator map is gone.
> > > > 
> > > > We might be able to remove the cost on free path as well using the
> > > > used_allocators scheme from above (to delay percpu_ref_kill), but it
> > > > is not clear how to safely increment the ref of the allocator from map
> > > > value...
> > 
> > As explained above, the values are already rcu-protected, so we can use that to
> > coordinate refcounting of the allocator. That said, percpu_ref could work (I was
> > considering something similar within the allocator itself) but I'm not convinced the
> > cost is right. My main concern would be that once it becomes atomic_t, it erases the
> > benefits of all the work in the allocator that maintains percpu data structures.
> > 
> > If we want to go down this path, the allocator can maintain percpu live counts (with
> > underflow for unbalanced alloc-free pairs on a cpu) in its percpu structures. Then,
> > we can have explicit "sum up all the counts to discover if you should be destroyed"
> > calls. If we keep the used_allocators scheme, these calls can be inserted at program
> > unload for maps in used_maps and at map free time for maps that escape that
> > mechanism.
> 
> Yes, it would be a good idea to put the percpu refcount for this case
> inside the already percpu bpf_mem_cache struct, since that will be
> much better than allocating it separately. The increment will then be
> a 100% cache hit.
> 
> The main question is how this "sum up all the count" operation needs
> to be done. Once that initial reference of bpf_map is gone, you need
> to track the final owner who will be responsible to release the
> allocator. You will need to do something similar to percpu_ref's
> atomic count upgrade unless I'm missing something.

It's actually easier since we know we can limit the checks to only the there's-no-
reference-from-an-allocator-map case. We can also postpone the work to the rcu gp to
make it even easier to sequence with the individual elements' free()s.

> 
> Once you establish that used_allocators cannot be safely populated on
> unload (which you can correct me on), the only utility I see for it is
> delaying the atomic upgrade for this idea.
> 
> So another approach (though I don't like this too much):
> One solution to delay the upgrade can be that when you have allocator
> maps and other normal maps in used_maps, it always incs ref on the
> allocator pinned by allocator map on unload for maps that have local
> kptr, so each map having local kptrs speculatively takes ref on
> allocator maps of this prog, assuming allocations from that allocator
> map are more likely to be in these maps.
> 
> Same with other progs having different allocator map but sharing these maps.
> 
> It is not very precise, but until those maps are gone it delays
> release of the allocator (we can empty all percpu caches to save
> memory once bpf_map pinning the allocator is gone, because allocations
> are not going to be served). But it allows unit_free to be relatively
> less costly as long as those 'candidate' maps are around.

Yes, we considered this but it's much easier to get to pathological behaviors, by
just loading and unloading programs that can access an allocator in a loop. The
freelists being empty help but it's still quite easy to hold a lot of memory for
nothing. 

The pointer walk was proposed to prune most such pathological cases while still being
conservative enough to be easy to implement. Only races with the pointer walk can
extend the lifetime unnecessarily.

> 
> 
> 
> > 
> > Or, we just extend the map-in-map mechanism to propagate used_allocators as needed.
> > There are nice debug properties of the allocator knowing the liveness counts but we
> > don't have to put them on the path to correctness.
> > 
> > > > 
> > > > wdyt?
> > > > 
> > > > > > Overall, we think this handles all the nasty corners - objects escaping into
> > > > > > kfuncs/helpers when they shouldn't, pinned maps containing pointers to allocations,
> > > > > > programs accessing multiple allocators having deterministic freelist behaviors -
> > > > > > while keeping the API and complexity sane. The used_allocators approach can certainly
> > > > > > be less conservative (or can be even precise) but for a v1 that's probably overkill.
> > > > > > 
> > > > > > Please, feel free to shoot holes in this design! We tried to capture everything but
> > > > > > I'd love confirmation that we didn't miss anything.
> > > > > > 
> > > > > > --Delyan
> > 
> >