On Fri, Aug 19, 2022 at 06:00:22PM +0200, Kumar Kartikeya Dwivedi wrote: > On Fri, 19 Aug 2022 at 10:55, Dave Marchevsky <davemarchevsky@xxxxxx> wrote: > > > > Hi Kumar, > > > > Alexei and I talked about locking and a few other things today in regards to my > > rbtree work. Some of this isn't a direct response to your ideas/notes here, > > but hoping to summarize today's discussion inline with your code samples and > > get your opinion. > > > > Also, some inline comments more directly addressing your notes. > > Hi Dave, thanks for sharing the notes. > > > > > On 8/17/22 5:04 AM, Kumar Kartikeya Dwivedi wrote: > > > Alexei and I had a meeting where we discussed some of my ideas related > > > to BPF linked lists. I am sharing the discussion with everyone to get > > > wider feedback, and document what we discussed. > > > > > > The hard stuff is the shared ownership case, hence we can discuss this > > > while we work on landing single ownership lists. I will be sharing my > > > patches for that as an RFC. > > > > > > 1. Definition > > > > > > We can use BTF declaration tags to annotate a common structure like > > > struct bpf_list_head, struct bpf_rb_root, etc. > > > > > > #define __value(kind, name, node) __attribute__((btf_decl_tag(#kind > > > ":" #name ":" #node)) > > > > > > struct foo { > > > unsigned long data; > > > struct bpf_list_node node; > > > }; > > > > > > struct map_value { > > > struct bpf_spin_lock lock; > > > struct bpf_list_head head __value(struct, foo, node); > > > }; > > > > > > This allows us to parameterize any kind of intrusive collection. > > > > > > For the map-in-map use case: > > > > > > struct bar { > > > unsigned long data; > > > struct bpf_list_node node; > > > }; > > > // Only two levels of types allowed, to ensure no cycles, and to > > > prevent ownership cycles > > > struct foo { > > > unsigned long data; > > > struct bpf_spin_lock lock; > > > struct bpf_list_node node; > > > struct bpf_list_head head __value(struct, bar, node); > > > }; > > > > > > struct map_value { > > > struct bpf_spin_lock lock; > > > struct bpf_list_head head __value(struct, foo, node); > > > }; > > > > > > > Will these still be 'bpf maps' under the hood? If the list were to use > > Nope, my idea was to get rid of maps for intrusive collections, and > you always put bpf_list_head, bpf_rb_root in a map value. For global > map-like use cases, you instead use global variables, which are also > inside the map value of a BPF_MAP_TYPE_ARRAY. IMO there is no need > anymore to add more and more map types with this new style of data > structures. It is much more ergonomic to just use the head structure, > either as a global variable, or as an allocated object, but again, > that's my opinion. There will be some pros and cons for either > approach :). +1 If we can avoid adding new map types and instead recognize 'struct bpf_list_head' and 'struct bpf_rb_root' in global data and in map values that would be the best. The code would look the most natural to developers familiar with the kernel code. > I am aware of the problem with global bpf_spin_lock (thanks to Alexei > for spotting it), but as you described it can be solved by moving them > into a different section. > > > convention similar to the rbtree RFC, the first (non map-in-map) def could be > > written like: > > > > struct foo { > > unsigned long data; > > struct bpf_list_node node; > > }; > > > > struct { > > __uint(type, BPF_MAP_TYPE_LINKED_LIST); > > __type(value, struct foo); > > } list SEC(".maps"); > > > > I think we're thinking along similar lines with regards to the BTF tag, but I > > was thinking of tagging the value type instead of the container, something like: > > > > struct foo { > > unsigned long data; > > struct bpf_list_node node __protected_list_node; > > }; > > > > 'protected' meaning verifier knows to prevent prog from touching the > > bpf_list_node. Currently my rbtree RFCv1 is just assuming that value type will > > have rb_node at offset 0. BTF tag would eliminate this offset requirement > > and allow value types to be part of multiple data structures: > > > > struct foo { > > unsigned long data; > > struct bpf_list_node node __protected_list_node; > > struct bpf_rb_node rb_node __protected_rb_node; > > }; I'm not sure what '__protected_rb_node' tag buys here. The verifier can infer the same from 'struct bpf_rb_node' type name. > > > > Not a hard requirement for me, but nice-to-have: support for heterogenous value > > types in same list / rbtree. Map def's __type(value, struct foo) wouldn't work > > in this case, I think your __value(struct, foo, node) would have same issue. I thought Kumar's proposal to do: struct bpf_list_head head __value(struct, foo, node); was to tell the verifier how to do iterate link list with appropriate container_of(). Same tagging is needed for: struct bpf_rb_root tree __value(struct, foo, node) to tell bpf_rb_find(key, tree, cmp_cb) what btf_id to return and how to container_of() from rb_node to bpf program supplied struct to pass that btf_id pointer into cmb_cb. > > > > But I think this should be possible with BTF tags somehow. List > > helpers are ostensibly only touching the list_{head,node} - and similar for > > rbtree, and we're both planning on explicit in-prog typed allocation. > > If type can be propagated from alloc -> reg state -> helper input -> > > helper output, helpers could use reg BTF info w/ properly tagged field > > to manipulate the right field in the value struct. > > We can have multiple __value on a list_head and add some way to > disambiguate what the type will be on list_remove. The problem is not > during list_add, as you know the type of the node being added. It is > when you list_remove, is when you need to be able to disambiguate the > type so that we set the reg as correct btf, btf_id and then set the > right reg->off (so that container_of can give the entry). Until the > disambiguation step is done, it is unknown what the type might be. list_remove and list iterate/find need that btf_id. > When thinking of heterogeneous values, we should probably look to add > a way to do generic variant types in BPF, such that e.g. the first > field is a tag, and then the actual data of that type follows. This > would allow us to use them not only in intrusive collections, but > anywhere else, and probably even store them in map values as kptrs. Not following this idea. Could you give an example? > > I think it is much simpler to start with homogenous types first, though. > > > > > In that case the tag would have to be on the value struct's field, not the > > container. I do like that your __value(struct, foo, node) is teaching the > > container what named field to manipulate. If value struct were to be part of > > two lists this would make it possible to disambiguate. > > > > When we discussed this Alexei mentioned existing pointer casting helper pattern > > (e.g. 'bpf_skc_to_tcp_sock') as potentially being helpful here. > > > > Indeed, but I think you need some bit of info at runtime to be able to do this. > > > > 2. Building Blocks - A type safe allocator > > > > > > Add bpf_kptr_alloc, bpf_kptr_free > > > This will use bpf_mem_alloc infra, allocator maps. > > > Alexei mentioned that Delyan is working on support for exposing > > > bpf_mem_alloc using an allocator map. > > > Allocates referenced PTR_TO_BTF_ID (should we call these local kptr?): > > > reg->btf == prog->aux->btf > > > reg->btf_id = bpf_core_type_id_local(...) > > > btf_struct_access allows writing to these objects. > > > Due to type visibility, we can embed objects with special semantics > > > inside these user defined types. > > > Add a concept of constructing/destructing kptr. > > > constructing -> normal kptr, escapable -> destructing > > > In constructing and destructing state, pointer cannot escape the > > > program. Hence, only one CPU is guaranteed to observe the object in > > > those states. So when we have access to single ownership kptr, we know > > > nobody else can access it. Hence we can also move its state from > > > normal to destructing state. > > > In case of shared ownership, we will have to rely on the result of > > > bpf_refcount_put for this to work. > > > > > > 3. Embedding special fields inside such allocated kptr > > > > > > We must allow programmer to compose their own user defined BPF object > > > out of building blocks provided by BPF. > > > BPF users may have certain special objects inside this allocated > > > object. E.g. bpf_list_node, bpf_spin_lock, even bpf_list_head > > > (map-in-map use case). > > > btf_struct_access won’t allow direct reads/writes to these fields. > > > Each of them needs to be constructed before the object is considered > > > fully constructed. > > > An unconstructed object’s kptr cannot escape a program, it can only be > > > destructed and freed. > > > This has multiple advantages. We can add fields which have complex > > > initialization requirements. > > > This also allows safe recycling of memory without having to do zero > > > init or inserting constructor calls automatically from verifier. > > > Allows constructors to have parameters in future, also allows complex > > > multi-step initialization of fields in future. > > > > > > > I don't fully understand "shared ownership" from 2) and don't have a use case > > Shared ownership is explained further later in section 5. > > > for complex constructors in 3), but broadly agree with everything else. Will > > do another pass. > > > > > 4. Single Ownership Linked Lists > > > > > > The kptr has single ownership. > > > Program has to release it before BPF_EXIT, either free or move it out > > > of program. > > > Once passed to list, the program loses ownership. > > > But BPF can track that until spin_lock is released, nobody else can > > > touch it, so we can technically still list_remove a node we added > > > using list_add, and then we will be owning it after unlock. > > > list_add marks reference_state as ‘release_on_unlock’ > > > list_remove unmark reference_state > > > Alexei: Similar to Dave’s approach, but different implementation. > > > bpf_spin_unlock walks acquired_refs and release_reference marked ones. > > > No other function calls allows in critical section, hence > > > reference_state remains same. > > > > > > ---------- > > > > > > 5. Shared Ownership > > > > > > Idea: Add bpf_refcount as special field embeddable in allocated kptrs. > > > bpf_refcount_set(const), bpf_refcount_inc(const), bpf_refcount_put(ptr). > > > If combined with RCU, can allow safe kptr_get operations for such objects. > > > Each rb_root, list_head requires ownership of node. > > > Caller will transfer its reference to them. > > > If having only a single reference, do inc before transfer. > > > It is a generic concept, and can apply to kernel types as well. > > > When linking after allocation, it is extremely cheap to set, add, add, add… > > > > > > We add ‘static_ref’ to each reference_state to track incs/decs > > > acq = static_ref = 1 > > > set = static_ref = K (must be in [1, …] range) > > > inc = static_ref += K > > > rel/put = static_ref -= 1 (may allow K, dunno) > > > > > > Alexei suggested that he prefers if helpers did the increment on their > > > own in case where the bpf_refcount field exists in the object. I.e. > > > instead of caller incrementing and then passing their reference to > > > lists or rbtree, the add helpers receive hidden parameter to refcount > > > field address automatically and bump the refcount when adding. In that > > > case, they won't be releasing caller's reference_state. > > > Then this static_ref code is not required. > > > > > > Kartikeya: No strong opinions, this is also another way. One advantage > > > of managing refcount on caller side and just keeping helpers move only > > > (regardless of single owner or shared owner kptr) is that helpers > > > themselves have the same semantics. It always moves ownership of a > > > reference. Also, one inc(K) and multiple add is a little cheaper than > > > multiple inc(1) on each add. > > > > > > 6. How does the verifier reason about shared kptr we don't know the state of? > > > > > > Consider a case where we load a kptr which has shared ownership from a > > > map using kptr_get. > > > > > > Now, it may have a list_node and a rb_node. We don't know whether this > > > node is already part of some list (so that list_node is occupied), > > > same for rb_node. > > > > > > There can be races like two CPUs having access to the node: > > > > > > CPU 0 CPU 1 > > > lock(&list1_lock) lock(&list2_lock) > > > list_add(&node, &list2) > > > next.prev = node; > > > node.next = next; list_remove(&node) > > > node.next = NULL; > > > node.prev = NULL; > > > node.prev = prev; > > > prev.next = node; > > > unlock(&list1_lock); unlock(&list2_lock); > > > > > > Interleavings can leave nodes in inconsistent states. > > > We need to ensure that when we are doing list_add or list_remove for > > > kptr we don't know the state of, it is only in a safe context with > > > ownership of that operation. > > > > > > Remove: > > > > > > When inside list_for_each helper, list_remove is safe for nodes since > > > we are protected by lock. > > > > > > Idea: An owner field alongside the list_node and rb_node. > > > list_add sets it to the address of list_head, list_remove sets it to > > > NULL. This will be done under spinlock of the list. > > > > > > When we get access to the object in an unknown state for these fields, > > > we first lock the list we want to remove it from, check the owner > > > field, and only remove it when we see that owner matches locked list. > > > > > > Each list_add updates owner, list_remove sets to NULL. > > > bpf_spin_lock(&lock); > > > if (bpf_list_owns_node(&i->node, &list)) { // checks owner > > > list_remove(&i->node); > > > } > > > bpf_spin_unlock(&lock); > > > > > > bpf_list_owns_node poisons pointer in false branch, so user can only > > > list_remove in true branch. > > > > > > If the owner is not a locked list pointer, it will be either NULL or > > > some other value (because of previous list_remove while holding same > > > lock, or list_add while holding some other list lock). > > > If the owner is our list pointer, we can be sure this is safe, as we > > > have already locked list. > > > Otherwise, previous critical section must have modified owner. > > > So one single load (after inlining this helper) allows unlinking > > > random kptr we have reference to, safely. > > > > > > Cost: 8-bytes per object. Advantages: Prevents bugs like racy > > > list_remove and double list_add, doesn't need fallible helpers (the > > > check that would have been inside has to be done by the user now). > > > Don't need the abort logic. > > > > > > > I agree, keeping track of owner seems necessary. Seems harder to verify > > statically than lock as well. Alexei mentioned today that combination > > "grab lock and take ownership" helper for dynamic check might make > > sense. > > > > Tangentially, I've been poking at ergonomics of > > libbpf lock definition this week and think I have something reasonable: > > > > struct node_data { > > struct rb_node node; > > __u32 one; > > __u32 two; > > }; > > > > struct l { > > __uint(type, BPF_MAP_TYPE_ARRAY); > > __type(key, u32); > > __type(value, struct bpf_spin_lock); > > __uint(max_entries, 1); > > } lock_arr SEC(".maps"); > > > > struct { > > __uint(type, BPF_MAP_TYPE_RBTREE); > > __type(value, struct node_data); > > __array(lock, struct l); > > } rbtree1 SEC(".maps") = { > > .lock = { > > [0] = &lock_arr, > > }, > > }; > > > > struct { > > __uint(type, BPF_MAP_TYPE_RBTREE); > > __type(value, struct node_data); > > __array(lock, struct l); > > } rbtree2 SEC(".maps") = { > > .lock = { > > [0] = &lock_arr, > > }, > > }; > > > > ... in BPF prog > > > > bpf_spin_lock(&lock_arr[0]); > > > > // Can safely operate on either tree, move nodes between them, etc. > > > > bpf_spin_unlock(&lock_arr[0]); > > > > > > Notes: > > * Verifier knows which lock is supposed to be used at map creation time > > * Can reuse bpf_verifier_state's 'active_spin_lock' member, so no addt'l > > bookkeeping needed to verify that rbtree_add or similar is happening > > in critical section > > Yes, this is similar to my approach, except what I'm doing is (suppose > we fix the bpf_spin_lock in mmap-able map value problem): We can teach libbpf to support more than 3 hard coded global maps (bss, rodata, data). So any named section will go into its own array with max_entries=1 that won't be mmap-able and will allow to host bpf_rb_root, bpf_spin_lock, etc. > The list_head and lock protecting it are global variables, hence in > the same map value for the global variable's array map (for now only > one lock is allowed in a map value, but we may allow some guarded_by > annotation to associate different locks to different containers). > Now, you can use the same active_spin_lock infra to track whether I > hold the one in the same map value as the list_head. More on that > below. > > > * Can benefit from relo goodness (e.g. rbtree3 using extern lock in another > > file) > > * If necessary, similar dynamic verification behavior as just keeping lock > > internal > > * Implementation similarities with map_of_map 'inner_map'. Similarly to > > inner_map_fd, kernel needs to know about lock_map_fd. Can use map_extra for > > this to avoid uapi changes > > > > Alexei and I discussed possibly allowing raw 'struct bpf_spin_lock' global var, > > which would require some additional libbpf changes as bpf_spin_lock can't be > > mmap'd and libbpf tries to mmap all .data maps currently. Perhaps a separate > > .data.no_mmap section. > > > > This ergonomics idea doesn't solve the map-in-map issue, I'm still unsure > > how to statically verify lock in that case. Have you had a chance to think > > about it further? > > > > You rely on the lock being in the same allocation, and manipulation > done on an object from the same 'lookup'. See below: > > struct foo { > struct bpf_spin_lock lock; > struct bpf_list_head head __value(...); > }; > > struct map_value { > struct foo __local_kptr *ptr; > }; > > struct { > __uint(type, BPF_MAP_TYPE_ARRAY); > __type(key, int); > __type(value, struct map_value); > __uint(max_entries, 8); > } array_of_lists SEC(".maps"); > > In my case, the structure is the map, so pointer to the structure > inside a map makes it map-in-map (now common, the existing map-in-maps > just hide this from you, so it's pretty much the same thing > anyway...). > > This is just an example, it can be one more level deep, but anyway. > > When I do a map lookup, there is check in > verifier.c:reg_may_point_to_spin_lock, this preserves reg->id on NULL > unmarking. This reg->id is then remembered when you take lock inside > this map value, to associate it back to unlock correctly. > > Now, suppose you load the kptr. You know the kptr has a lock, you will > update this check to also consider local kptr with locks. The reg->id > is preserved after loaded kptr is NULL checked, but it is unique for > each load of the kptr. You lock spin_lock in the kptr, you then add to > list, the list_add verifier check goes and sees whether the current > lock held and the current list_head come from the same reg->id (you > know the reg of list_head, right? So you know the id as well, and you > match that to cur->active_spin_lock_id). If so, it is the correct > lock, we locked the lock in the same loaded kptr as the one whose > list_head we are list_add-ing to. > > For global variables, the check needs more work. In the normal map > lookup case, we assign fresh reg->id whenever you do a map lookup, so > in case of array map spin lock for the same key will set different id > in cur->active_spin_lock_id for two different map values from two > different lookups. This is because we don't know if it is the same map > value on second lookup, so both locks in different map value are > considered different locks. The id is the unique lock id, essentially. > > Since global variables are in direct_value_addr map with 1 max_entry, > we don't need to assign fresh reg->id and each pseudo ldimm64 insn. We > can instead teach it to either track it using id (for the case of > normal map lookups and local kptr), or map_ptr to accomodate global > variables non-unique ids. At once, only one of two is set, the other > is zero. > > Then everything falls in place. We always match both map_ptr and id. > For global data and map lookups, the map_ptr is matched, id will be 0 > for global data, non zero for normal map lookups. There is only one > map value so the lock protects everything in it. For the other case I > described above, map_ptr is NULL but id will be different if not from > the same 'lookup' in case of local kptr (PTR_TO_BTF_ID). > > We also have map_uid, which is assigned to map_ptr of inner map > lookups. But remember that we are talking of map values above, so even > if for lookups from two differ inner maps of same map, we get two map > values whose map_ptr is technically same (even if the map_uid was > different), their reg->id _will_ be different, so the above checks are > sufficient to disambiguate spin locks for all kinds of cases. > > Keeping lock and data in the same allocation thus allows you to > associate locks statically even for dynamic allocations, enabling the > map-in-map use case. All that makes sense, but consider use case where we need rb_root for every cgroup. The 'struct bpf_rb_root' will be created dynamically in cgroup local storage. We can create a lock in the same cgroup local storage as well. It's all nice and the verifier can do locking checks statically, but how the program can trasnfer and rb_node from one rb tree to another in a different cgroup? Either two locks need to be held and I don't see a way to check that statically or one bpf_spin_lock should be used across all cgroups. Thoughts?