On Mon, Sep 26, 2022 at 09:18:46PM +0800, Hou Tao wrote: > Hi, > > On 9/26/2022 9:25 AM, Alexei Starovoitov wrote: > > On Sat, Sep 24, 2022 at 09:36:07PM +0800, Hou Tao wrote: > >> From: Hou Tao <houtao1@xxxxxxxxxx> > SNIP > >> For atomic ops and kmalloc overhead, I think I can reuse the idea from > >> patchset "bpf: BPF specific memory allocator". I have given bpf_mem_alloc > >> a simple try and encounter some problems. One problem is that > >> immediate reuse of freed object in bpf memory allocator. Because qp-trie > >> uses bpf memory allocator to allocate and free qp_trie_branch, if > >> qp_trie_branch is reused immediately, the lookup procedure may oops due > >> to the incorrect content in qp_trie_branch. And another problem is the > >> size limitation in bpf_mem_alloc() is 4096. It may be a little small for > >> the total size of key size and value size, but maybe I can use two > >> separated bpf_mem_alloc for key and value. > > 4096 limit for key+value size would be an acceptable trade-off. > > With kptrs the user will be able to extend value to much bigger sizes > > while doing <= 4096 allocation at a time. Larger allocations are failing > > in production more often than not. Any algorithm relying on successful > > >= 4096 allocation is likely to fail. kvmalloc is a fallback that > > the kernel is using, but we're not there yet in bpf land. > > The benefits of bpf_mem_alloc in qp-trie would be huge though. > > qp-trie would work in all contexts including sleepable progs. > > As presented the use cases for qp-trie are quite limited. > > If I understand correctly the concern for not using bpf_mem_alloc > > is that qp_trie_branch can be reused. Can you provide an exact scenario > > that will casue issuses? > The usage of branch node during lookup is as follows: > (1) check the index field of branch node which records the position of nibble in > which the keys of child nodes are different > (2) calculate the index of child node by using the nibble value of lookup key in > index position > (3) get the pointer of child node by dereferencing the variable-length pointer > array in branch node > > Because both branch node and leaf node have variable length, I used one > bpf_mem_alloc for these two node types, so if a leaf node is reused as a branch > node, the pointer got in step 3 may be invalid. > > If using separated bpf_mem_alloc for branch node and leaf node, it may still be > problematic because the updates to a reused branch node are not atomic and > branch nodes with different child node will reuse the same object due to size > alignment in allocator, so the lookup procedure below may get an uninitialized > pointer in the pointer array: > > lookup procedure update procedure > > > // three child nodes, 48-bytes > branch node x > // four child > nodes, 56-bytes > reuse branch node x > x->bitmap = 0xf > // got an uninitialized pointer > x->nodes[3] > Initialize > x->nodes[0~3] Looking at lookup: + while (is_branch_node(node)) { + struct qp_trie_branch *br = node; + unsigned int bitmap; + unsigned int iip; + + /* When byte index equals with key len, the target key + * may be in twigs->nodes[0]. + */ + if (index_to_byte_index(br->index) > data_len) + goto done; + + bitmap = calc_br_bitmap(br->index, data, data_len); + if (!(bitmap & br->bitmap)) + goto done; + + iip = calc_twig_index(br->bitmap, bitmap); + node = rcu_dereference_check(br->nodes[iip], rcu_read_lock_bh_held()); + } To be safe the br->index needs to be initialized after br->nodex and br->bitmap. While deleting the br->index can be set to special value which would mean restart the lookup from the beginning. As you're suggesting with smp_rmb/wmb pairs the lookup will only see valid br. Also the race is extremely tight, right? After brb->nodes[iip] + is_branch_node that memory needs to deleted on other cpu after spin_lock and reused in update after another spin_lock. Without artifical big delay it's hard to imagine how nodes[iip] pointer would be initialized to some other qp_trie_branch or leaf during delete, then memory reused and nodes[iip] is initialized again with the same address. Theoretically possible, but unlikely, right? And with correct ordering of scrubbing and updates to br->nodes, br->bitmap, br->index it can be made safe. We can add a sequence number to qp_trie_branch as well and read it before and after. Every reuse would inc the seq. If seq number differs, re-read the node pointer form parent. > The problem may can be solved by zeroing the unused or whole part of allocated > object. Maybe adding a paired smp_wmb() and smp_rmb() to ensure the update of > node array happens before the update of bitmap is also OK and the cost will be > much cheaper in x86 host. Something like this, right. We can also consider doing lookup under spin_lock. For a large branchy trie the cost of spin_lock maybe negligible. > Beside lookup procedure, get_next_key() from syscall also lookups trie > locklessly. If the branch node is reused, the order of returned keys may be > broken. There is also a parent pointer in branch node and it is used for reverse > lookup during get_next_key, the reuse may lead to unexpected skip in iteration. qp_trie_lookup_next_node can be done under spin_lock. Iterating all map elements is a slow operation anyway. > > Instead of call_rcu in qp_trie_branch_free (which will work only for > > regular progs and have high overhead as demonstrated by mem_alloc patches) > > the qp-trie freeing logic can scrub that element, so it's ready to be > > reused as another struct qp_trie_branch. > > I guess I'm missing how rcu protects this internal data structures of qp-trie. > > The rcu_read_lock of regular bpf prog helps to stay lock-less during lookup? > > Is that it? > Yes. The update is made atomic by copying the parent branch node to a new branch > node and replacing the pointer to the parent branch node by the new branch node, > so the lookup procedure either find the old branch node or the new branch node. > > So to make qp-trie work in sleepable progs the algo would need to > > be changed to do both call_rcu and call_rcu_task_trace everywhere > > to protect these inner structs? > > call_rcu_task_trace can take long time. So qp_trie_branch-s may linger > > around. So quick update/delete (in sleepable with call_rcu_task_trace) > > may very well exhaust memory. With bpf_mem_alloc we don't have this issue > > since rcu_task_trace gp is observed only when freeing into global mem pool. > > Say qp-trie just uses bpf_mem_alloc for qp_trie_branch. > > What is the worst that can happen? qp_trie_lookup_elem will go into wrong > > path, but won't crash, right? Can we do hlist_nulls trick to address that? > > In other words bpf_mem_alloc reuse behavior is pretty much SLAB_TYPESAFE_BY_RCU. > > Many kernel data structures know how to deal with such object reuse. > > We can have a private bpf_mem_alloc here for qp_trie_branch-s only and > > construct a logic in a way that obj reuse is not problematic. > As said above, qp_trie_lookup_elem may be OK with SLAB_TYPESAFE_BY_RCU. But I > don't know how to do it for get_next_key because the iteration result needs to > be ordered and can not skip existed elements before the iterations begins. imo it's fine to spin_lock in get_next_key. We should measure the lock overhead in lookup. It might be acceptable too. > If removing immediate reuse from bpf_mem_alloc, beside the may-decreased > performance, is there any reason we can not do that ? What do you mean? Always do call_rcu + call_rcu_tasks_trace for every bpf_mem_free ? As I said above: " call_rcu_task_trace can take long time. So qp_trie_branch-s may linger around. So quick update/delete (in sleepable with call_rcu_task_trace) may very well exhaust memory. " As an exercise try samples/bpf/map_perf_test on non-prealloc hashmap before mem_alloc conversion. Just regular call_rcu consumes 100% of all cpus. With call_rcu_tasks_trace it's worse. It cannot sustain such flood.