On 12/7/22 2:36 PM, Kumar Kartikeya Dwivedi wrote: > On Wed, Dec 07, 2022 at 04:39:47AM IST, Dave Marchevsky wrote: >> This series adds a rbtree datastructure following the "next-gen >> datastructure" precedent set by recently-added linked-list [0]. This is >> a reimplementation of previous rbtree RFC [1] to use kfunc + kptr >> instead of adding a new map type. This series adds a smaller set of API >> functions than that RFC - just the minimum needed to support current >> cgfifo example scheduler in ongoing sched_ext effort [2], namely: >> >> bpf_rbtree_add >> bpf_rbtree_remove >> bpf_rbtree_first >> >> [...] >> >> Future work: >> Enabling writes to release_on_unlock refs should be done before the >> functionality of BPF rbtree can truly be considered complete. >> Implementing this proved more complex than expected so it's been >> pushed off to a future patch. >> > > TBH, I think we need to revisit whether there's a strong need for this. I would > even argue that we should simply make the release semantics of rbtree_add, > list_push helpers stronger and remove release_on_unlock logic entirely, > releasing the node immediately. I don't see why it is so critical to have read, > and more importantly, write access to nodes after losing their ownership. And > that too is only available until the lock is unlocked. > Moved the next paragraph here to ease reply, it was the last paragraph in your response. > > Can you elaborate on actual use cases where immediate release or not having > write support makes it hard or impossible to support a certain use case, so that > it is easier to understand the requirements and design things accordingly? > Sure, the main usecase and impetus behind this for me is the sched_ext work Tejun and others are doing (https://lwn.net/Articles/916291/). One of the things they'd like to be able to do is implement a CFS-like scheduler using rbtree entirely in BPF. This would prove that sched_ext + BPF can be used to implement complicated scheduling logic. If we can implement such complicated scheduling logic, but it has so much BPF-specific twisting of program logic that it's incomprehensible to scheduler folks, that's not great. The overlap between "BPF experts" and "scheduler experts" is small, and we want the latter group to be able to read BPF scheduling logic without too much struggle. Lower learning curve makes folks more likely to experiment with sched_ext. When 'rbtree map' was in brainstorming / prototyping, non-owning reference semantics were called out as moving BPF datastructures closer to their kernel equivalents from a UX perspective. If the "it makes BPF code better resemble normal kernel code" argumentwas the only reason to do this I wouldn't feel so strongly, but there are practical concerns as well: If we could only read / write from rbtree node if it isn't in a tree, the common operation of "find this node and update its data" would require removing and re-adding it. For rbtree, these unnecessary remove and add operations could result in unnecessary rebalancing. Going back to the sched_ext usecase, if we have a rbtree with task or cgroup stats that need to be updated often, unnecessary rebalancing would make this update slower than if non-owning refs allowed in-place read/write of node data. Also, we eventually want to be able to have a node that's part of both a list and rbtree. Likely adding such a node to both would require calling kfunc for adding to list, and separate kfunc call for adding to rbtree. Once the node has been added to list, we need some way to represent a reference to that node so that we can pass it to rbtree add kfunc. Sounds like a non-owning reference to me, albeit with different semantics than current release_on_unlock. > I think this relaxed release logic and write support is the wrong direction to > take, as it has a direct bearing on what can be done with a node inside the > critical section. There's already the problem with not being able to do > bpf_obj_drop easily inside the critical section with this. That might be useful > for draining operations while holding the lock. > The bpf_obj_drop case is similar to your "can't pass non-owning reference to bpf_rbtree_remove" concern from patch 1's thread. If we have: n = bpf_obj_new(...); // n is owning ref bpf_rbtree_add(&tree, &n->node); // n is non-owning ref res = bpf_rbtree_first(&tree); if (!res) {...} m = container_of(res, struct node_data, node); // m is non-owning ref res = bpf_rbtree_remove(&tree, &n->node); n = container_of(res, struct node_data, node); // n is owning ref, m points to same memory bpf_obj_drop(n); // Not safe to use m anymore Datastructures which support bpf_obj_drop in the critical section can do same as my bpf_rbtree_remove suggestion: just invalidate all non-owning references after bpf_obj_drop. Then there's no potential use-after-free. (For the above example, pretend bpf_rbtree_remove didn't already invalidate 'm', or that there's some other way to obtain non-owning ref to 'n''s node after rbtree_remove) I think that, in practice, operations where the BPF program wants to remove / delete nodes will be distinct from operations where program just wants to obtain some non-owning refs and do read / write. At least for sched_ext usecase this is true. So all the additional clobbers won't require program writer to do special workarounds to deal with verifier in the common case. > Semantically in other languages, once you move an object, accessing it is > usually a bug, and in most of the cases it is sufficient to prepare it before > insertion. We are certainly in the same territory here with these APIs. Sure, but 'add'/'remove' for these intrusive linked datastructures is _not_ a 'move'. Obscuring this from the user and forcing them to use less performant patterns for the sake of some verifier complexity, or desire to mimic semantics of languages w/o reference stability, doesn't make sense to me. If we were to add some datastructures without reference stability, sure, let's not do non-owning references for those. So let's make this non-owning reference stuff easy to turn on/off, perhaps via KF_RELEASE_NON_OWN or similar flags, which will coincidentally make it very easy to remove if we later decide that the complexity isn't worth it.