Re: [PATCH bpf-next v2 00/13] Add support for qp-trie with dynptr key

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Mon, 26 Sep 2022 20:18:07 -0700

On Mon, Sep 26, 2022 at 8:08 PM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote:
>
> Hi,
>
> On 9/27/2022 9:19 AM, Alexei Starovoitov wrote:
> > On Mon, Sep 26, 2022 at 09:18:46PM +0800, Hou Tao wrote:
> >> Hi,
> >>
> >> On 9/26/2022 9:25 AM, Alexei Starovoitov wrote:
> >>> On Sat, Sep 24, 2022 at 09:36:07PM +0800, Hou Tao wrote:
> >>>> From: Hou Tao <houtao1@xxxxxxxxxx>
> >> SNIP
> >>>> For atomic ops and kmalloc overhead, I think I can reuse the idea from
> >>>> patchset "bpf: BPF specific memory allocator". I have given bpf_mem_alloc
> >>>> a simple try and encounter some problems. One problem is that
> >>>> immediate reuse of freed object in bpf memory allocator. Because qp-trie
> >>>> uses bpf memory allocator to allocate and free qp_trie_branch, if
> >>>> qp_trie_branch is reused immediately, the lookup procedure may oops due
> >>>> to the incorrect content in qp_trie_branch. And another problem is the
> >>>> size limitation in bpf_mem_alloc() is 4096. It may be a little small for
> >>>> the total size of key size and value size, but maybe I can use two
> >>>> separated bpf_mem_alloc for key and value.
> >>> 4096 limit for key+value size would be an acceptable trade-off.
> >>> With kptrs the user will be able to extend value to much bigger sizes
> >>> while doing <= 4096 allocation at a time. Larger allocations are failing
> >>> in production more often than not. Any algorithm relying on successful
> >>>  >= 4096 allocation is likely to fail. kvmalloc is a fallback that
> >>> the kernel is using, but we're not there yet in bpf land.
> >>> The benefits of bpf_mem_alloc in qp-trie would be huge though.
> >>> qp-trie would work in all contexts including sleepable progs.
> >>> As presented the use cases for qp-trie are quite limited.
> >>> If I understand correctly the concern for not using bpf_mem_alloc
> >>> is that qp_trie_branch can be reused. Can you provide an exact scenario
> >>> that will casue issuses?
> >> The usage of branch node during lookup is as follows:
> >> (1) check the index field of branch node which records the position of nibble in
> >> which the keys of child nodes are different
> >> (2) calculate the index of child node by using the nibble value of lookup key in
> >> index position
> >> (3) get the pointer of child node by dereferencing the variable-length pointer
> >> array in branch node
> >>
> >> Because both branch node and leaf node have variable length, I used one
> >> bpf_mem_alloc for these two node types, so if a leaf node is reused as a branch
> >> node, the pointer got in step 3 may be invalid.
> >>
> >> If using separated bpf_mem_alloc for branch node and leaf node, it may still be
> >> problematic because the updates to a reused branch node are not atomic and
> >> branch nodes with different child node will reuse the same object due to size
> >> alignment in allocator, so the lookup procedure below may get an uninitialized
> >> pointer in the pointer array:
> >>
> >> lookup procedure                                update procedure
> >>
> >>
> >> // three child nodes, 48-bytes
> >> branch node x
> >>                                                               //  four child
> >> nodes, 56-bytes
> >>                                                               reuse branch node x
> >>                                                               x->bitmap = 0xf
> >> // got an uninitialized pointer
> >> x->nodes[3]
> >>                                                               Initialize
> >> x->nodes[0~3]
> > Looking at lookup:
> > +     while (is_branch_node(node)) {
> > +             struct qp_trie_branch *br = node;
> > +             unsigned int bitmap;
> > +             unsigned int iip;
> > +
> > +             /* When byte index equals with key len, the target key
> > +              * may be in twigs->nodes[0].
> > +              */
> > +             if (index_to_byte_index(br->index) > data_len)
> > +                     goto done;
> > +
> > +             bitmap = calc_br_bitmap(br->index, data, data_len);
> > +             if (!(bitmap & br->bitmap))
> > +                     goto done;
> > +
> > +             iip = calc_twig_index(br->bitmap, bitmap);
> > +             node = rcu_dereference_check(br->nodes[iip], rcu_read_lock_bh_held());
> > +     }
> >
> > To be safe the br->index needs to be initialized after br->nodex and br->bitmap.
> > While deleting the br->index can be set to special value which would mean
> > restart the lookup from the beginning.
> > As you're suggesting with smp_rmb/wmb pairs the lookup will only see valid br.
> > Also the race is extremely tight, right?
> > After brb->nodes[iip] + is_branch_node that memory needs to deleted on other cpu
> > after spin_lock and reused in update after another spin_lock.
> > Without artifical big delay it's hard to imagine how nodes[iip] pointer
> > would be initialized to some other qp_trie_branch or leaf during delete,
> > then memory reused and nodes[iip] is initialized again with the same address.
> > Theoretically possible, but unlikely, right?
> > And with correct ordering of scrubbing and updates to
> > br->nodes, br->bitmap, br->index it can be made safe.
> The reuse of node not only introduces the safety problem (e.g. access an invalid
> pointer), but also incur the false negative problem (e.g. can not find an
> existent element) as show below:
>
> lookup A in X on CPU1            update X on CPU 2
>
>      [ branch X v1 ]
>  leaf A | leaf B | leaf C
>                                                  [ branch X v2 ]
>                                                leaf A | leaf B | leaf C | leaf D
>
>                                                   // free and reuse branch X v1
>                                                   [ branch X v1 ]
>                                                 leaf O | leaf P | leaf Q
> // leaf A can not be found

Right. That's why I suggested to consider hlist_nulls-like approach
that htab is using.

> > We can add a sequence number to qp_trie_branch as well and read it before and after.
> > Every reuse would inc the seq.
> > If seq number differs, re-read the node pointer form parent.
> A seq number on qp_trie_branch is a good idea. Will try it. But we also need to
> consider the starvation of lookup by update/deletion. Maybe need fallback to the
> subtree spinlock after some reread.

I think the fallback is an overkill. The race is extremely unlikely.

> >> The problem may can be solved by zeroing the unused or whole part of allocated
> >> object. Maybe adding a paired smp_wmb() and smp_rmb() to ensure the update of
> >> node array happens before the update of bitmap is also OK and the cost will be
> >> much cheaper in x86 host.
> > Something like this, right.
> > We can also consider doing lookup under spin_lock. For a large branchy trie
> > the cost of spin_lock maybe negligible.
> Do you meaning adding an extra spinlock to qp_trie_branch to protect again reuse
> or taking the subtree spinlock during lookup ? IMO the latter will make the
> lookup performance suffer, but I will check it as well.

subtree lock. lookup perf will suffer a bit.
The numbers will tell the true story.

> >
> >> Beside lookup procedure, get_next_key() from syscall also lookups trie
> >> locklessly. If the branch node is reused, the order of returned keys may be
> >> broken. There is also a parent pointer in branch node and it is used for reverse
> >> lookup during get_next_key, the reuse may lead to unexpected skip in iteration.
> > qp_trie_lookup_next_node can be done under spin_lock.
> > Iterating all map elements is a slow operation anyway.
> OK. Taking subtree spinlock is simpler but the scalability will be bad. Not sure
> whether or not the solution for lockless lookup will work for get_next_key. Will
> check.

What kind of scalability are you concerned about?
get_next is done by user space only. Plenty of overhead already.

> >
> >>> Instead of call_rcu in qp_trie_branch_free (which will work only for
> >>> regular progs and have high overhead as demonstrated by mem_alloc patches)
> >>> the qp-trie freeing logic can scrub that element, so it's ready to be
> >>> reused as another struct qp_trie_branch.
> >>> I guess I'm missing how rcu protects this internal data structures of qp-trie.
> >>> The rcu_read_lock of regular bpf prog helps to stay lock-less during lookup?
> >>> Is that it?
> >> Yes. The update is made atomic by copying the parent branch node to a new branch
> >> node and replacing the pointer to the parent branch node by the new branch node,
> >> so the lookup procedure either find the old branch node or the new branch node.
> >>> So to make qp-trie work in sleepable progs the algo would need to
> >>> be changed to do both call_rcu and call_rcu_task_trace everywhere
> >>> to protect these inner structs?
> >>> call_rcu_task_trace can take long time. So qp_trie_branch-s may linger
> >>> around. So quick update/delete (in sleepable with call_rcu_task_trace)
> >>> may very well exhaust memory. With bpf_mem_alloc we don't have this issue
> >>> since rcu_task_trace gp is observed only when freeing into global mem pool.
> >>> Say qp-trie just uses bpf_mem_alloc for qp_trie_branch.
> >>> What is the worst that can happen? qp_trie_lookup_elem will go into wrong
> >>> path, but won't crash, right? Can we do hlist_nulls trick to address that?
> >>> In other words bpf_mem_alloc reuse behavior is pretty much SLAB_TYPESAFE_BY_RCU.
> >>> Many kernel data structures know how to deal with such object reuse.
> >>> We can have a private bpf_mem_alloc here for qp_trie_branch-s only and
> >>> construct a logic in a way that obj reuse is not problematic.
> >> As said above, qp_trie_lookup_elem may be OK with SLAB_TYPESAFE_BY_RCU. But I
> >> don't know how to do it for get_next_key because the iteration result needs to
> >> be ordered and can not skip existed elements before the iterations begins.
> > imo it's fine to spin_lock in get_next_key.
> > We should measure the lock overhead in lookup. It might be acceptable too.
> Will check that.
> >
> >> If removing immediate reuse from bpf_mem_alloc, beside the may-decreased
> >> performance, is there any reason we can not do that ?
> > What do you mean?
> > Always do call_rcu + call_rcu_tasks_trace for every bpf_mem_free ?
> Yes. Does doing call_rcu() + call_rcu_task_trace in batch help just like
> free_bulk does ?
> > As I said above:
> > " call_rcu_task_trace can take long time. So qp_trie_branch-s may linger
> >   around. So quick update/delete (in sleepable with call_rcu_task_trace)
> >   may very well exhaust memory.
> > "
> > As an exercise try samples/bpf/map_perf_test on non-prealloc hashmap
> > before mem_alloc conversion. Just regular call_rcu consumes 100% of all cpus.
> > With call_rcu_tasks_trace it's worse. It cannot sustain such flood.
> > .
> Will check the result of map_perf_test. But it seems bpf_mem_alloc may still
> exhaust memory if __free_rcu_tasks_trace() can not called timely, Will take a
> close lookup on that.

In theory. yes. The batching makes a big difference.