On Mon, Sep 26, 2022 at 8:08 PM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote: > > Hi, > > On 9/27/2022 9:19 AM, Alexei Starovoitov wrote: > > On Mon, Sep 26, 2022 at 09:18:46PM +0800, Hou Tao wrote: > >> Hi, > >> > >> On 9/26/2022 9:25 AM, Alexei Starovoitov wrote: > >>> On Sat, Sep 24, 2022 at 09:36:07PM +0800, Hou Tao wrote: > >>>> From: Hou Tao <houtao1@xxxxxxxxxx> > >> SNIP > >>>> For atomic ops and kmalloc overhead, I think I can reuse the idea from > >>>> patchset "bpf: BPF specific memory allocator". I have given bpf_mem_alloc > >>>> a simple try and encounter some problems. One problem is that > >>>> immediate reuse of freed object in bpf memory allocator. Because qp-trie > >>>> uses bpf memory allocator to allocate and free qp_trie_branch, if > >>>> qp_trie_branch is reused immediately, the lookup procedure may oops due > >>>> to the incorrect content in qp_trie_branch. And another problem is the > >>>> size limitation in bpf_mem_alloc() is 4096. It may be a little small for > >>>> the total size of key size and value size, but maybe I can use two > >>>> separated bpf_mem_alloc for key and value. > >>> 4096 limit for key+value size would be an acceptable trade-off. > >>> With kptrs the user will be able to extend value to much bigger sizes > >>> while doing <= 4096 allocation at a time. Larger allocations are failing > >>> in production more often than not. Any algorithm relying on successful > >>> >= 4096 allocation is likely to fail. kvmalloc is a fallback that > >>> the kernel is using, but we're not there yet in bpf land. > >>> The benefits of bpf_mem_alloc in qp-trie would be huge though. > >>> qp-trie would work in all contexts including sleepable progs. > >>> As presented the use cases for qp-trie are quite limited. > >>> If I understand correctly the concern for not using bpf_mem_alloc > >>> is that qp_trie_branch can be reused. Can you provide an exact scenario > >>> that will casue issuses? > >> The usage of branch node during lookup is as follows: > >> (1) check the index field of branch node which records the position of nibble in > >> which the keys of child nodes are different > >> (2) calculate the index of child node by using the nibble value of lookup key in > >> index position > >> (3) get the pointer of child node by dereferencing the variable-length pointer > >> array in branch node > >> > >> Because both branch node and leaf node have variable length, I used one > >> bpf_mem_alloc for these two node types, so if a leaf node is reused as a branch > >> node, the pointer got in step 3 may be invalid. > >> > >> If using separated bpf_mem_alloc for branch node and leaf node, it may still be > >> problematic because the updates to a reused branch node are not atomic and > >> branch nodes with different child node will reuse the same object due to size > >> alignment in allocator, so the lookup procedure below may get an uninitialized > >> pointer in the pointer array: > >> > >> lookup procedure update procedure > >> > >> > >> // three child nodes, 48-bytes > >> branch node x > >> // four child > >> nodes, 56-bytes > >> reuse branch node x > >> x->bitmap = 0xf > >> // got an uninitialized pointer > >> x->nodes[3] > >> Initialize > >> x->nodes[0~3] > > Looking at lookup: > > + while (is_branch_node(node)) { > > + struct qp_trie_branch *br = node; > > + unsigned int bitmap; > > + unsigned int iip; > > + > > + /* When byte index equals with key len, the target key > > + * may be in twigs->nodes[0]. > > + */ > > + if (index_to_byte_index(br->index) > data_len) > > + goto done; > > + > > + bitmap = calc_br_bitmap(br->index, data, data_len); > > + if (!(bitmap & br->bitmap)) > > + goto done; > > + > > + iip = calc_twig_index(br->bitmap, bitmap); > > + node = rcu_dereference_check(br->nodes[iip], rcu_read_lock_bh_held()); > > + } > > > > To be safe the br->index needs to be initialized after br->nodex and br->bitmap. > > While deleting the br->index can be set to special value which would mean > > restart the lookup from the beginning. > > As you're suggesting with smp_rmb/wmb pairs the lookup will only see valid br. > > Also the race is extremely tight, right? > > After brb->nodes[iip] + is_branch_node that memory needs to deleted on other cpu > > after spin_lock and reused in update after another spin_lock. > > Without artifical big delay it's hard to imagine how nodes[iip] pointer > > would be initialized to some other qp_trie_branch or leaf during delete, > > then memory reused and nodes[iip] is initialized again with the same address. > > Theoretically possible, but unlikely, right? > > And with correct ordering of scrubbing and updates to > > br->nodes, br->bitmap, br->index it can be made safe. > The reuse of node not only introduces the safety problem (e.g. access an invalid > pointer), but also incur the false negative problem (e.g. can not find an > existent element) as show below: > > lookup A in X on CPU1 update X on CPU 2 > > [ branch X v1 ] > leaf A | leaf B | leaf C > [ branch X v2 ] > leaf A | leaf B | leaf C | leaf D > > // free and reuse branch X v1 > [ branch X v1 ] > leaf O | leaf P | leaf Q > // leaf A can not be found Right. That's why I suggested to consider hlist_nulls-like approach that htab is using. > > We can add a sequence number to qp_trie_branch as well and read it before and after. > > Every reuse would inc the seq. > > If seq number differs, re-read the node pointer form parent. > A seq number on qp_trie_branch is a good idea. Will try it. But we also need to > consider the starvation of lookup by update/deletion. Maybe need fallback to the > subtree spinlock after some reread. I think the fallback is an overkill. The race is extremely unlikely. > >> The problem may can be solved by zeroing the unused or whole part of allocated > >> object. Maybe adding a paired smp_wmb() and smp_rmb() to ensure the update of > >> node array happens before the update of bitmap is also OK and the cost will be > >> much cheaper in x86 host. > > Something like this, right. > > We can also consider doing lookup under spin_lock. For a large branchy trie > > the cost of spin_lock maybe negligible. > Do you meaning adding an extra spinlock to qp_trie_branch to protect again reuse > or taking the subtree spinlock during lookup ? IMO the latter will make the > lookup performance suffer, but I will check it as well. subtree lock. lookup perf will suffer a bit. The numbers will tell the true story. > > > >> Beside lookup procedure, get_next_key() from syscall also lookups trie > >> locklessly. If the branch node is reused, the order of returned keys may be > >> broken. There is also a parent pointer in branch node and it is used for reverse > >> lookup during get_next_key, the reuse may lead to unexpected skip in iteration. > > qp_trie_lookup_next_node can be done under spin_lock. > > Iterating all map elements is a slow operation anyway. > OK. Taking subtree spinlock is simpler but the scalability will be bad. Not sure > whether or not the solution for lockless lookup will work for get_next_key. Will > check. What kind of scalability are you concerned about? get_next is done by user space only. Plenty of overhead already. > > > >>> Instead of call_rcu in qp_trie_branch_free (which will work only for > >>> regular progs and have high overhead as demonstrated by mem_alloc patches) > >>> the qp-trie freeing logic can scrub that element, so it's ready to be > >>> reused as another struct qp_trie_branch. > >>> I guess I'm missing how rcu protects this internal data structures of qp-trie. > >>> The rcu_read_lock of regular bpf prog helps to stay lock-less during lookup? > >>> Is that it? > >> Yes. The update is made atomic by copying the parent branch node to a new branch > >> node and replacing the pointer to the parent branch node by the new branch node, > >> so the lookup procedure either find the old branch node or the new branch node. > >>> So to make qp-trie work in sleepable progs the algo would need to > >>> be changed to do both call_rcu and call_rcu_task_trace everywhere > >>> to protect these inner structs? > >>> call_rcu_task_trace can take long time. So qp_trie_branch-s may linger > >>> around. So quick update/delete (in sleepable with call_rcu_task_trace) > >>> may very well exhaust memory. With bpf_mem_alloc we don't have this issue > >>> since rcu_task_trace gp is observed only when freeing into global mem pool. > >>> Say qp-trie just uses bpf_mem_alloc for qp_trie_branch. > >>> What is the worst that can happen? qp_trie_lookup_elem will go into wrong > >>> path, but won't crash, right? Can we do hlist_nulls trick to address that? > >>> In other words bpf_mem_alloc reuse behavior is pretty much SLAB_TYPESAFE_BY_RCU. > >>> Many kernel data structures know how to deal with such object reuse. > >>> We can have a private bpf_mem_alloc here for qp_trie_branch-s only and > >>> construct a logic in a way that obj reuse is not problematic. > >> As said above, qp_trie_lookup_elem may be OK with SLAB_TYPESAFE_BY_RCU. But I > >> don't know how to do it for get_next_key because the iteration result needs to > >> be ordered and can not skip existed elements before the iterations begins. > > imo it's fine to spin_lock in get_next_key. > > We should measure the lock overhead in lookup. It might be acceptable too. > Will check that. > > > >> If removing immediate reuse from bpf_mem_alloc, beside the may-decreased > >> performance, is there any reason we can not do that ? > > What do you mean? > > Always do call_rcu + call_rcu_tasks_trace for every bpf_mem_free ? > Yes. Does doing call_rcu() + call_rcu_task_trace in batch help just like > free_bulk does ? > > As I said above: > > " call_rcu_task_trace can take long time. So qp_trie_branch-s may linger > > around. So quick update/delete (in sleepable with call_rcu_task_trace) > > may very well exhaust memory. > > " > > As an exercise try samples/bpf/map_perf_test on non-prealloc hashmap > > before mem_alloc conversion. Just regular call_rcu consumes 100% of all cpus. > > With call_rcu_tasks_trace it's worse. It cannot sustain such flood. > > . > Will check the result of map_perf_test. But it seems bpf_mem_alloc may still > exhaust memory if __free_rcu_tasks_trace() can not called timely, Will take a > close lookup on that. In theory. yes. The batching makes a big difference.