On Sat, Apr 22, 2023 at 08:42:47PM +0200, Kumar Kartikeya Dwivedi wrote: > On Sat, Apr 22, 2023 at 05:21:36AM CEST, Alexei Starovoitov wrote: > > On Sat, Apr 22, 2023 at 04:03:45AM +0200, Kumar Kartikeya Dwivedi wrote: > > > On Sat, Apr 15, 2023 at 10:18:02PM CEST, Dave Marchevsky wrote: > > > > This series adds support for refcounted local kptrs to the verifier. A local > > > > kptr is 'refcounted' if its type contains a struct bpf_refcount field: > > > > > > > > struct refcounted_node { > > > > long data; > > > > struct bpf_list_node ll; > > > > struct bpf_refcount ref; > > > > }; > > > > > > > > bpf_refcount is used to implement shared ownership for local kptrs. > > > > > > > > Motivating usecase > > > > ================== > > > > > > > > If a struct has two collection node fields, e.g.: > > > > > > > > struct node { > > > > long key; > > > > long val; > > > > struct bpf_rb_node rb; > > > > struct bpf_list_node ll; > > > > }; > > > > > > > > It's not currently possible to add a node to both the list and rbtree: > > > > > > > > long bpf_prog(void *ctx) > > > > { > > > > struct node *n = bpf_obj_new(typeof(*n)); > > > > if (!n) { /* ... */ } > > > > > > > > bpf_spin_lock(&lock); > > > > > > > > bpf_list_push_back(&head, &n->ll); > > > > bpf_rbtree_add(&root, &n->rb, less); /* Assume a resonable less() */ > > > > bpf_spin_unlock(&lock); > > > > } > > > > > > > > The above program will fail verification due to current owning / non-owning ref > > > > logic: after bpf_list_push_back, n is a non-owning reference and thus cannot be > > > > passed to bpf_rbtree_add. The only way to get an owning reference for the node > > > > that was added is to bpf_list_pop_{front,back} it. > > > > > > > > More generally, verifier ownership semantics expect that a node has one > > > > owner (program, collection, or stashed in map) with exclusive ownership > > > > of the node's lifetime. The owner free's the node's underlying memory when it > > > > itself goes away. > > > > > > > > Without a shared ownership concept it's impossible to express many real-world > > > > usecases such that they pass verification. > > > > > > > > Semantic Changes > > > > ================ > > > > > > > > Before this series, the verifier could make this statement: "whoever has the > > > > owning reference has exclusive ownership of the referent's lifetime". As > > > > demonstrated in the previous section, this implies that a BPF program can't > > > > have an owning reference to some node if that node is in a collection. If > > > > such a state were possible, the node would have multiple owners, each thinking > > > > they have exclusive ownership. In order to support shared ownership it's > > > > necessary to modify the exclusive ownership semantic. > > > > > > > > After this series' changes, an owning reference has ownership of the referent's > > > > lifetime, but it's not necessarily exclusive. The referent's underlying memory > > > > is guaranteed to be valid (i.e. not free'd) until the reference is dropped or > > > > used for collection insert. > > > > > > > > This change doesn't affect UX of owning or non-owning references much: > > > > > > > > * insert kfuncs (bpf_rbtree_add, bpf_list_push_{front,back}) still require > > > > an owning reference arg, as ownership still must be passed to the > > > > collection in a shared-ownership world. > > > > > > > > * non-owning references still refer to valid memory without claiming > > > > any ownership. > > > > [...] > > > > > > I think there are a series of major issues right now. I am not sure everything > > > can be addressed using bug fixes. > > > > > > If I had to summarize the main problems in one liners: > > > The mutation of the node fields of an object can be racy. > > > Lack of collection identity either at runtime or verification. > > > > > > -- > > > > > > It is possible for multiple CPUs to get owned references to an object containing > > > a rbtree or list node, and they can attempt to modify those fields in parallel > > > without any synchronization. > > > > > > CPU 0 CPU 1 > > > n = bpf_obj_new(...) > > > m = bpf_refcount_acquire(n) > > > kptr_xchg(map, m) > > > m = kptr_xchg(map, NULL) > > > // m == n > > > bpf_spin_lock(lock1) bpf_spin_lock(lock2) > > > bpf_rbtree_add(rbtree1, m) bpf_rbtree_add(rbtree2, n) > > > if (!RB_EMPTY_NODE) if (!RB_EMPTY_NODE) // RACE, both continue with 'else' > > > bpf_spin_unlock(lock1) bpf_spin_unlock(lock2) > > > > > > -- > > > > > > Blocking kptr_xchg for shared ownership nodes as a stopgap solution won't be > > > sufficient. Consider this: > > > > > > Two CPUs can do (consider rbtree1 having the only element we add from CPU 0): > > > > > > CPU 0 CPU 1 > > > n = bpf_obj_new(...) > > > bpf_spin_lock(lock1) > > > bpf_rbtree_add(rbtree1, n) > > > m = bpf_refcount_acquire(n) > > > bpf_spin_unlock(lock1) > > > bpf_spin_lock(lock1) > > > n = bpf_rbtree_remove(bpf_rbtree_first(rbtree1)) > > > bpf_spin_unlock(lock1) > > > // let m == n > > > bpf_spin_lock(lock1) bpf_spin_lock(lock2) > > > bpf_rbtree_add(rbtree1, m) bpf_rbtree_add(rbtree2, n) > > > if (!RB_EMPTY_NODE) if (!RB_EMPTY_NODE) // RACE, both continue with 'else' > > > bpf_spin_unlock(lock1) bpf_spin_unlock(lock2) > > > > > > -- > > > > > > There's also no discussion on the problem with collection identities we > > > discussed before (maybe you plan to address it later): > > > https://lore.kernel.org/bpf/45e80d2e-af16-8584-12ec-c4c301d9a69d@xxxxxxxx > > > > > > But static tracaking won't be sufficient any longer. Considering another case > > > where the runtime will be confused about which rbtree a node belongs to. > > > > > > CPU 0 CPU 1 > > > n = bpf_obj_new(...) > > > m = bpf_refcount_acquire(n) > > > kptr_xchg(map, m) > > > p = kptr_xchg(map, NULL) > > > lock(lock2) > > > bpf_rbtree_add(rbtree2, p->rnode) > > > unlock(lock2) > > > lock(lock1) > > > bpf_list_push_back(head1, n->lnode) > > > // make n non-owning ref > > > bpf_rbtree_remove(rbtree1, n->rnode); // OOPS, remove without lock2 > > > unlock(lock1) > > > > Thanks for describing the races. > > > > > -- > > > > > > I can come up with multiple other examples. The point I'm trying to drive home > > > is that it's a more fundamental issue in the way things are set up right now, > > > not something that was overlooked during the implementation. > > > > > > The root cause is that access to a node's fields is going to be racy once > > > multiple CPUs have *mutable* references to it. The lack of ownership information > > > mapped to the collection either at runtime or during verification is also a > > > separate problem. > > > > > > When we were discussing this a few months ago, we tried to form consensus on > > > synchronizing updates over a node using an 'owner' pointer to eliminate similar > > > races. Every node field has an associated owner field, which each updater > > > modifying that node synchronizes over. > > > > > > In short: > > > Node's owner describes its state at runtime. > > > node.owner == ptr_of_ds // part of DS > > > > what is DS ? > > > > The graph data structure. > > > > node.owner == NULL // not part of DS > > > node.owner == BPF_PTR_POISON // busy (about to go to NULL or ptr_of_ds before BPF_EXIT) > > > > > > cmpxchg(node.owner, NULL, BPF_PTR_POISON) to make a free node busy. > > > bpf_rbtree_add and such will do this to claim node ownership before trying to > > > link it in, and then store owner to ptr_of_ds. The _store_ will be the > > > *linearization* point of bpf_rbtree_add, not cmpxchg. > > > > > > READ_ONCE(node.owner) == ptr_of_ds to ensure node belongs to locked ds, and will > > > remain in this state until the end of CS, since ptr_to_ds to NULL only happens > > > in remove under a held lock for the ds. bpf_rbtree_remove will do this check > > > before WRITE_ONCE of NULL to unlink a node. > > > > > > Ofcourse, this is slower, and requires extra space in the object, but unless > > > this or something else is used to eliminate races, there will be bugs. > > > > Makes sense to me. Where do you propose to store the owner? Another explicit > > field that bpf prog has to provide inside a node? > > A hidden field in bpf_refcount ? > > A hidden field in bpf_list_node / bpf_rb_node ? > > It will have to be with each bpf_list_node / bpf_rb_node. yeah. realized after pressed 'send'. > Pseudocode wise, this is how things will work: > > bpf_rbtree_add(rbtree, node): > // Move a node from free to busy state > if (!cmpxchg(node.owner, NULL, BPF_PTR_POISON)) > return -EINVAL should be bpf_obj_drop here instead of plain return. In other words the current if (!RB_EMPTY_NODE()) bpf_obj_drop will be replaced by the above cmpxchg > __rb_add(...) > WRITE_ONCE(node.owner, rbtree) > > bpf_rbtree_remove(rbtree, node): > // Check if node is part of rbtree: > // If != rbtree, it could be changing concurrently > // If == rbtree, changing it requires the lock we are holding > if (READ_ONCE(node.owner) != rbtree) > return -EINVAL should be 'return NULL'. The above check will replace current RB_EMPTY_NODE check. > __rb_remove(...) > WRITE_ONCE(node.owner, NULL) comparing to existing code it's only extra WRITE_ONCE and cmpxchg instead of load+cmp. imo the perf cost is roughly the same. Because of that there is no need to have two bpf_rbtree_add/remove for shared/exclusive nodes. > Likewise for the linked list helpers. > > My opinion would be to have a different bpf_list_node_shared and > bpf_rb_node_shared for refcount_off >= 0 case, which encode both bpf_list_node > and bpf_rb_node and their owner together, since the space and runtime check > tradeoff is not required if you're using exclusive ownership. true, but runtime performance of bpf_rbtree_add/remove is the same for both cases. With bpf_rb_node_shared the bpf prog will save 8 bytes of memory at the expense of plenty verifier complexity that would need to have two checks if (btf_id == bpf_rb_node_shared || btf_id == bpf_rb_node) in a bunch of places and one of the two in another set of places. I'd rather waste 8 bytes than complicate the verifier. Having valid node->owner for exclusive rb-tree and link list is a nice debug feature too. Not saying we screwed up non-shared rb-tree, but I won't be surprised if somebody will figure out how to construct similar race there in the future. > One more point is that you can minimize cost of cmpxchg by adding compound > operation helpers. > > For instance (under appropriate locking): > > bpf_rbtree_move(rbtree1, rbtree2, node): > if (READ_ONCE(node.owner) != rbtree1) > return -EINVAL > __rb_remove(rbtree1, node) > __rb_add(rbtree2, node) > WRITE_ONCE(node.owner, rbtree2) > > Instead of: > bpf_rbtree_remove(rbtree1, node) > bpf_rbtree_add(rbtree2, node) That makes sense. Not urgent. Distant follow up.