Re: [PATCH bpf-next 00/13] BPF rbtree next-gen datastructure

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Thu, 8 Dec 2022 12:36:54 -0800

On Thu, Dec 08, 2022 at 06:27:29PM +0530, Kumar Kartikeya Dwivedi wrote:
> 
> I don't mind using active_lock.id for invalidation, but using reg->id to
> associate it with reg is a bad idea IMO, it's already preserved and set when the
> object has bpf_spin_lock in it, and it's going to allow doing bpf_spin_unlock
> with that non-owing ref if it has a spin lock, essentially unlocking different
> spin lock if the reg->btf of already locked spin lock reg is same due to same
> active_lock.id.

Right. Overwriting reg->id was a bad idea.

> Even if you prevent it somehow it's more confusing to overload reg->id again for
> this purpose.
> 
> It makes more sense to introduce a new nonref_obj_id instead dedicated for this
> purpose, to associate it back to the reg->id of the collection it is coming from.

nonref_obj_id name sounds too generic and I'm not sure that it shouldn't be
connected to reg->id the way we do it for ref_obj_id.

> Also, there are two cases of invalidation, one is on remove from rbtree, which
> should only invalidate non-owning references into the rbtree, and one is on
> unlock, which should invalidate all non-owning references.

Two cases only if we're going to do invalidation on rbtree_remove.

> bpf_rbtree_remove shouldn't invalidate non-owning into list protected by same
> lock, but unlocking should do it for both rbtree and list non-owning refs it is
> protecting.
> 
> So it seems you will have to maintain two IDs for non-owning referneces, one for
> the collection it comes from, and one for the lock region it is obtained in.

Right. Like this ?
collection_id = rbroot->reg->id; // to track the collection it came from
active_lock_id = cur_state->active_lock.id // to track the lock region

but before we proceed let me demonstrate an example where
cleanup on rbtree_remove is not user friendly:

bpf_spin_lock
x = bpf_list_first(); if (!x) ..
y = bpf_list_last(); if (!y) ..

n = bpf_list_remove(x); if (!n) ..

bpf_list_add_after(n, y); // we should allow this
bpf_spin_unlock

We don't have such apis right now.
The point here that cleanup after bpf_list_remove/bpf_rbtree_remove will destroy
all regs that point somewhere in the collection.
This way we save run-time check in bpf_rbtree_remove, but sacrificing usability.

x and y could be pointing to the same thing.
In such case bpf_list_add_after() should fail in runtime after discovering
that 'y' is unlinked.

Similarly with bpf_rbtree_add().
Currently it cannot fail. It takes owning ref and will release it.
We can mark it as KF_RELEASE and no extra verifier changes necessary.

But in the future we might have failing add/insert operations on lists and rbtree.
If they're failing we'd need to struggle with 'conditional release' verifier additions,
the bpf prog would need to check return value, etc.

I think we better deal with it in run-time.
The verifier could supply bpf_list_add_after() with two hidden args:
- container_of offset (delta between rb_node and begining of prog's struct)
- struct btf_struct_meta *meta
Then inside bpf_list_add_after or any failing KF_RELEASE kfunc
it can call bpf_obj_drop_impl() that element.
Then from the verifier pov the KF_RELEASE function did the release
and 'owning ref' became 'non-owning ref'.

> > >> And you're also adding 'untrusted' here, mainly as a result of
> > >> bpf_rbtree_add(tree, node) - 'node' becoming untrusted after it's added,
> > >> instead of becoming a non-owning ref. 'untrusted' would have state like:
> > >>
> > >> PTR_TO_BTF_ID | MEM_ALLOC (w/ rb_node type)
> > >> PTR_UNTRUSTED
> > >> ref_obj_id == 0?
> > >
> > > I'm not sure whether we really need full untrusted after going through bpf_rbtree_add()
> > > or doing 'non-owning' is enough.
> > > If it's full untrusted it will be:
> > > PTR_TO_BTF_ID | PTR_UNTRUSTED && ref_obj_id == 0
> > >
> >
> > Yeah, I don't see what this "full untrusted" is giving us either. Let's have
> > "cleanup non-owning refs on spin_unlock" just invalidate the regs for now,
> > instead of converting to "full untrusted"?
> >
> 
> +1, I prefer invalidating completely on unlock.

fine by me.

> 
> I think it's better to clean by invalidating. We have better tools to form
> untrusted pointers (like bpf_rdonly_cast) now if the BPF program writer needs
> such an escape hatch for some reason. It's also easier to review where an
> untrusted pointer is being used in a program, and has zero cost at runtime.

ok. Since it's more strict we can relax to untrusted later if necessary.

> So far I'm leaning towards:
> 
> bpf_rbtree_add(node) : node becomes non-owned ref
> bpf_spin_unlock(lock) : node is invalidated

ok

> > > Currently I'm leaning towards PTR_UNTRUSTED for cleanup after bpf_spin_unlock
> > > and non-owning after bpf_rbtree_add.
> > >
> > > Walking the example from previous email:
> > >
> > > struct bpf_rbtree_iter it;
> > > struct bpf_rb_node * node;
> > > struct bpf_rb_node *n, *m;
> > >
> > > bpf_rbtree_iter_init(&it, rb_root); // locks the rbtree works as bpf_spin_lock
> > > while ((node = bpf_rbtree_iter_next(&it)) {
> > >   // node -> PTR_TO_BTF_ID | MEM_ALLOC | MAYBE_NULL && ref_obj_id == 0
> > >   if (node && node->field == condition) {
> > >
> > >     n = bpf_rbtree_remove(rb_root, node);
> > >     if (!n) ...;
> > >     // n -> PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == X
> > >     m = bpf_rbtree_remove(rb_root, node); // ok, but fails in run-time
> > >     if (!m) ...;
> > >     // m -> PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == Y
> > >
> 
> This second remove I would simply disallow as Dave is suggesting during
> verification, by invalidating non-owning refs for rb_root.

Looks like cleanup from non-owning to untrusted|unknown on bpf_rbtree_remove is our
only remaining disagreement.
I feel run-time checks will be fast enough and will improve usabililty.

Also it feels that not doing cleanup on rbtree_remove is simpler to
implement and reason about.

Here is the proposal with one new field 'active_lock_id':

first = bpf_rbtree_first(root) KF_RET_NULL
  check_reg_allocation_locked() checks that root->reg->id == cur->active_lock.id
  R0 = PTR_TO_BTF_ID|MEM_ALLOC|PTR_MAYBE_NULL ref_obj_id = 0;
  R0->active_lock_id = root->reg->id
  R0->id = ++env->id_gen; which will be cleared after !NULL check inside prog.

same way we can add rb_find, rb_find_first,
but not rb_next, rb_prev, since they don't have 'root' argument.

bpf_rbtree_add(root, node, cb); KF_RELEASE.
  needs to see PTR_TO_BTF_ID|MEM_ALLOC node->ref_obj_id > 0
  check_reg_allocation_locked() checks that root->reg->id == cur->active_lock.id
  calls release_reference(node->ref_obj_id)
  converts 'node' to PTR_TO_BTF_ID|MEM_ALLOC ref_obj_id = 0;
  node->active_lock_id = root->reg->id

'node' is equivalent to 'first'. They both point to some element
inside rbtree and valid inside spin_locked region.
It's ok to read|write to both under lock.

removed_node = bpf_rbtree_remove(root, node); KF_ACQUIRE|KF_RET_NULL
  need to see PTR_TO_BTF_ID|MEM_ALLOC node->ref_obj_id = 0; and 
  usual check_reg_allocation_locked(root)
  R0 = PTR_TO_BTF_ID|MEM_ALLOC|MAYBE_NULL
  R0->ref_obj_id = R0->id = acquire_reference_state();
  R0->active_lock_id should stay 0
  mark_reg_unknown(node)

bpf_spin_unlock(lock);
  checks lock->id == cur->active_lock.id
  for all regs in state 
    if (reg->active_lock_id == lock->id)
       mark_reg_unknown(reg)