On Wed, Dec 07, 2022 at 05:28:34PM -0500, Dave Marchevsky wrote: > On 12/7/22 2:36 PM, Kumar Kartikeya Dwivedi wrote: > > On Wed, Dec 07, 2022 at 04:39:47AM IST, Dave Marchevsky wrote: > >> This series adds a rbtree datastructure following the "next-gen > >> datastructure" precedent set by recently-added linked-list [0]. This is > >> a reimplementation of previous rbtree RFC [1] to use kfunc + kptr > >> instead of adding a new map type. This series adds a smaller set of API > >> functions than that RFC - just the minimum needed to support current > >> cgfifo example scheduler in ongoing sched_ext effort [2], namely: > >> > >> bpf_rbtree_add > >> bpf_rbtree_remove > >> bpf_rbtree_first > >> > >> [...] > >> > >> Future work: > >> Enabling writes to release_on_unlock refs should be done before the > >> functionality of BPF rbtree can truly be considered complete. > >> Implementing this proved more complex than expected so it's been > >> pushed off to a future patch. > >> > > > > > TBH, I think we need to revisit whether there's a strong need for this. I would > > even argue that we should simply make the release semantics of rbtree_add, > > list_push helpers stronger and remove release_on_unlock logic entirely, > > releasing the node immediately. I don't see why it is so critical to have read, > > and more importantly, write access to nodes after losing their ownership. And > > that too is only available until the lock is unlocked. > > > > Moved the next paragraph here to ease reply, it was the last paragraph > in your response. > > > > > Can you elaborate on actual use cases where immediate release or not having > > write support makes it hard or impossible to support a certain use case, so that > > it is easier to understand the requirements and design things accordingly? > > > > Sure, the main usecase and impetus behind this for me is the sched_ext work > Tejun and others are doing (https://lwn.net/Articles/916291/). One of the > things they'd like to be able to do is implement a CFS-like scheduler using > rbtree entirely in BPF. This would prove that sched_ext + BPF can be used to > implement complicated scheduling logic. > > If we can implement such complicated scheduling logic, but it has so much > BPF-specific twisting of program logic that it's incomprehensible to scheduler > folks, that's not great. The overlap between "BPF experts" and "scheduler > experts" is small, and we want the latter group to be able to read BPF > scheduling logic without too much struggle. Lower learning curve makes folks > more likely to experiment with sched_ext. > > When 'rbtree map' was in brainstorming / prototyping, non-owning reference > semantics were called out as moving BPF datastructures closer to their kernel > equivalents from a UX perspective. Our emails crossed. See my previous email. Agree on the above. > If the "it makes BPF code better resemble normal kernel code" argumentwas the > only reason to do this I wouldn't feel so strongly, but there are practical > concerns as well: > > If we could only read / write from rbtree node if it isn't in a tree, the common > operation of "find this node and update its data" would require removing and > re-adding it. For rbtree, these unnecessary remove and add operations could Not really. See my previous email. > result in unnecessary rebalancing. Going back to the sched_ext usecase, > if we have a rbtree with task or cgroup stats that need to be updated often, > unnecessary rebalancing would make this update slower than if non-owning refs > allowed in-place read/write of node data. Agree. Read/write from non-owning refs is necessary. In the other email I'm arguing that PTR_TRUSTED with ref_obj_id == 0 (your non-owning ref) should not be mixed with release_on_unlock logic. KF_RELEASE should still accept as args and release only ptrs with ref_obj_id > 0. > > Also, we eventually want to be able to have a node that's part of both a > list and rbtree. Likely adding such a node to both would require calling > kfunc for adding to list, and separate kfunc call for adding to rbtree. > Once the node has been added to list, we need some way to represent a reference > to that node so that we can pass it to rbtree add kfunc. Sounds like a > non-owning reference to me, albeit with different semantics than current > release_on_unlock. A node with both link list and rbtree would be a new concept. We'd need to introduce 'struct bpf_refcnt' and make sure prog does the right thing. That's a future discussion. > > > I think this relaxed release logic and write support is the wrong direction to > > take, as it has a direct bearing on what can be done with a node inside the > > critical section. There's already the problem with not being able to do > > bpf_obj_drop easily inside the critical section with this. That might be useful > > for draining operations while holding the lock. > > > > The bpf_obj_drop case is similar to your "can't pass non-owning reference > to bpf_rbtree_remove" concern from patch 1's thread. If we have: > > n = bpf_obj_new(...); // n is owning ref > bpf_rbtree_add(&tree, &n->node); // n is non-owning ref what I proposed in the other email... n should be untrusted here. That's != 'n is non-owning ref' > res = bpf_rbtree_first(&tree); > if (!res) {...} > m = container_of(res, struct node_data, node); // m is non-owning ref agree. m == PTR_TRUSTED with ref_obj_id == 0. > res = bpf_rbtree_remove(&tree, &n->node); a typo here? Did you mean 'm->node' ? and after 'if (res)' ... > n = container_of(res, struct node_data, node); // n is owning ref, m points to same memory agree. n -> ref_obj_id > 0 > bpf_obj_drop(n); above is ok to do. 'n' becomes UNTRUSTED or invalid. > // Not safe to use m anymore 'm' should have become UNTRUSTED after bpf_rbtree_remove. > Datastructures which support bpf_obj_drop in the critical section can > do same as my bpf_rbtree_remove suggestion: just invalidate all non-owning > references after bpf_obj_drop. 'invalidate all' sounds suspicious. I don't think we need to do sweaping search after bpf_obj_drop. > Then there's no potential use-after-free. > (For the above example, pretend bpf_rbtree_remove didn't already invalidate > 'm', or that there's some other way to obtain non-owning ref to 'n''s node > after rbtree_remove) > > I think that, in practice, operations where the BPF program wants to remove > / delete nodes will be distinct from operations where program just wants to > obtain some non-owning refs and do read / write. At least for sched_ext usecase > this is true. So all the additional clobbers won't require program writer > to do special workarounds to deal with verifier in the common case. > > > Semantically in other languages, once you move an object, accessing it is > > usually a bug, and in most of the cases it is sufficient to prepare it before > > insertion. We are certainly in the same territory here with these APIs. > > Sure, but 'add'/'remove' for these intrusive linked datastructures is > _not_ a 'move'. Obscuring this from the user and forcing them to use > less performant patterns for the sake of some verifier complexity, or desire > to mimic semantics of languages w/o reference stability, doesn't make sense to > me. I agree, but everything we discuss in the above looks orthogonal to release_on_unlock that myself and Kumar are proposing to drop. > If we were to add some datastructures without reference stability, sure, let's > not do non-owning references for those. So let's make this non-owning reference > stuff easy to turn on/off, perhaps via KF_RELEASE_NON_OWN or similar flags, > which will coincidentally make it very easy to remove if we later decide that > the complexity isn't worth it. You mean KF_RELEASE_NON_OWN would be applied to bpf_rbtree_remove() ? So it accepts PTR_TRUSTED ref_obj_id == 0 arg and makes it PTR_UNTRUSTED ? If so then I agree. The 'release' part of the name was confusing. It's also not clear which arg it applies to. bpf_rbtree_remove has two args. Both are PTR_TRUSTED. I wouldn't introduce a new flag for this just yet. We can hard code bpf_rbtree_remove, bpf_list_pop for now or use our name suffix hack.