Re: [PATCH bpf-next v2 00/13] Add support for qp-trie with dynptr key

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

On 9/26/2022 9:25 AM, Alexei Starovoitov wrote:
> On Sat, Sep 24, 2022 at 09:36:07PM +0800, Hou Tao wrote:
>> From: Hou Tao <houtao1@xxxxxxxxxx>
SNIP
>> For atomic ops and kmalloc overhead, I think I can reuse the idea from
>> patchset "bpf: BPF specific memory allocator". I have given bpf_mem_alloc
>> a simple try and encounter some problems. One problem is that
>> immediate reuse of freed object in bpf memory allocator. Because qp-trie
>> uses bpf memory allocator to allocate and free qp_trie_branch, if
>> qp_trie_branch is reused immediately, the lookup procedure may oops due
>> to the incorrect content in qp_trie_branch. And another problem is the
>> size limitation in bpf_mem_alloc() is 4096. It may be a little small for
>> the total size of key size and value size, but maybe I can use two
>> separated bpf_mem_alloc for key and value.
> 4096 limit for key+value size would be an acceptable trade-off.
> With kptrs the user will be able to extend value to much bigger sizes
> while doing <= 4096 allocation at a time. Larger allocations are failing
> in production more often than not. Any algorithm relying on successful
>  >= 4096 allocation is likely to fail. kvmalloc is a fallback that
> the kernel is using, but we're not there yet in bpf land.
> The benefits of bpf_mem_alloc in qp-trie would be huge though.
> qp-trie would work in all contexts including sleepable progs.
> As presented the use cases for qp-trie are quite limited.
> If I understand correctly the concern for not using bpf_mem_alloc
> is that qp_trie_branch can be reused. Can you provide an exact scenario
> that will casue issuses?
The usage of branch node during lookup is as follows:
(1) check the index field of branch node which records the position of nibble in
which the keys of child nodes are different
(2) calculate the index of child node by using the nibble value of lookup key in
index position
(3) get the pointer of child node by dereferencing the variable-length pointer
array in branch node

Because both branch node and leaf node have variable length, I used one
bpf_mem_alloc for these two node types, so if a leaf node is reused as a branch
node, the pointer got in step 3 may be invalid.

If using separated bpf_mem_alloc for branch node and leaf node, it may still be
problematic because the updates to a reused branch node are not atomic and
branch nodes with different child node will reuse the same object due to size
alignment in allocator, so the lookup procedure below may get an uninitialized
pointer in the pointer array:

lookup procedure                                update procedure


// three child nodes, 48-bytes
branch node x
                                                              //  four child
nodes, 56-bytes
                                                              reuse branch node x
                                                              x->bitmap = 0xf
// got an uninitialized pointer
x->nodes[3]
                                                              Initialize
x->nodes[0~3]

The problem may can be solved by zeroing the unused or whole part of allocated
object. Maybe adding a paired smp_wmb() and smp_rmb() to ensure the update of
node array happens before the update of bitmap is also OK and the cost will be
much cheaper in x86 host.

Beside lookup procedure, get_next_key() from syscall also lookups trie
locklessly. If the branch node is reused, the order of returned keys may be
broken. There is also a parent pointer in branch node and it is used for reverse
lookup during get_next_key, the reuse may lead to unexpected skip in iteration.
> Instead of call_rcu in qp_trie_branch_free (which will work only for
> regular progs and have high overhead as demonstrated by mem_alloc patches)
> the qp-trie freeing logic can scrub that element, so it's ready to be
> reused as another struct qp_trie_branch.
> I guess I'm missing how rcu protects this internal data structures of qp-trie.
> The rcu_read_lock of regular bpf prog helps to stay lock-less during lookup?
> Is that it?
Yes. The update is made atomic by copying the parent branch node to a new branch
node and replacing the pointer to the parent branch node by the new branch node,
so the lookup procedure either find the old branch node or the new branch node.
> So to make qp-trie work in sleepable progs the algo would need to
> be changed to do both call_rcu and call_rcu_task_trace everywhere
> to protect these inner structs?
> call_rcu_task_trace can take long time. So qp_trie_branch-s may linger
> around. So quick update/delete (in sleepable with call_rcu_task_trace)
> may very well exhaust memory. With bpf_mem_alloc we don't have this issue
> since rcu_task_trace gp is observed only when freeing into global mem pool.
> Say qp-trie just uses bpf_mem_alloc for qp_trie_branch.
> What is the worst that can happen? qp_trie_lookup_elem will go into wrong
> path, but won't crash, right? Can we do hlist_nulls trick to address that?
> In other words bpf_mem_alloc reuse behavior is pretty much SLAB_TYPESAFE_BY_RCU.
> Many kernel data structures know how to deal with such object reuse.
> We can have a private bpf_mem_alloc here for qp_trie_branch-s only and
> construct a logic in a way that obj reuse is not problematic.
As said above, qp_trie_lookup_elem may be OK with SLAB_TYPESAFE_BY_RCU. But I
don't know how to do it for get_next_key because the iteration result needs to
be ordered and can not skip existed elements before the iterations begins.
If removing immediate reuse from bpf_mem_alloc, beside the may-decreased
performance, is there any reason we can not do that ?
>
> Another alternative would be to add explicit rcu_read_lock in qp_trie_lookup_elem
> to protect qp_trie_branch during lookup while using bpf_mem_alloc
> for both qp_trie_branch and leaf nodes, but that's not a great solution either.
> It will allow qp-trie to be usable in sleepable, but use of call_rcu
> in update/delete will prevent qp-trie to be usable in tracing progs.
>
> Let's try to brainstorm how to make qp_trie_branch work like SLAB_TYPESAFE_BY_RCU.
>
> Other than this issue the patches look great. This new map would be awesome addition.
Thanks for that.




[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux