Re: [PATCH bpf-next v2 00/13] Add support for qp-trie with dynptr key

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Sun, 25 Sep 2022 18:25:35 -0700

On Sat, Sep 24, 2022 at 09:36:07PM +0800, Hou Tao wrote:
> From: Hou Tao <houtao1@xxxxxxxxxx>
> 
> Hi,
> 
> The initial motivation for qp-trie map is to reduce memory usage for
> string keys special those with large differencies in length as
> discussed in [0]. And as a big-endian lexicographical ordered map, it
> can also be used for any binary data with fixed or variable length.
> 
> Now the basic functionality of qp-trie is ready, so posting it to get
> more feedback or suggestions about qp-trie. Specially feedback
> about the following questions:
> 
> (1) Use cases for qp-trie
> Andrii had proposed to re-implement lpm-trie by using qp-trie. The
> advantage would be the speed up of lookup operations due to lower tree
> depth of qp-trie and the performance of update may also increase.
> But is there any other use cases for qp-trie ? Specially those cases
> which need both ordering and memory efficiency or cases in which qp-trie
> will have high fan-out and its lookup performance will be much better than
> hash-table as shown below:
> 
>   Randomly-generated binary data (key size=255, max entries=16K, key length range:[1, 255])
>   htab lookup      (1  thread)    4.968 ± 0.009M/s (drops 0.002 ± 0.000M/s mem 8.169 MiB)
>   htab lookup      (2  thread)   10.118 ± 0.010M/s (drops 0.007 ± 0.000M/s mem 8.169 MiB)
>   htab lookup      (4  thread)   20.084 ± 0.022M/s (drops 0.007 ± 0.000M/s mem 8.168 MiB)
>   htab lookup      (8  thread)   39.866 ± 0.047M/s (drops 0.010 ± 0.000M/s mem 8.168 MiB)
>   htab lookup      (16 thread)   79.412 ± 0.065M/s (drops 0.049 ± 0.000M/s mem 8.169 MiB)
>   
>   qp-trie lookup   (1  thread)   10.291 ± 0.007M/s (drops 0.004 ± 0.000M/s mem 4.899 MiB)
>   qp-trie lookup   (2  thread)   20.797 ± 0.009M/s (drops 0.006 ± 0.000M/s mem 4.879 MiB)
>   qp-trie lookup   (4  thread)   41.943 ± 0.019M/s (drops 0.015 ± 0.000M/s mem 4.262 MiB)
>   qp-trie lookup   (8  thread)   81.985 ± 0.032M/s (drops 0.025 ± 0.000M/s mem 4.215 MiB)
>   qp-trie lookup   (16 thread)  164.681 ± 0.051M/s (drops 0.050 ± 0.000M/s mem 4.261 MiB)
> 
>   * non-zero drops is due to duplicated keys in generated keys.
> 
> (2) Improve update/delete performance for qp-trie
> Now top-5 overheads in update/delete operations are:
> 
>     21.23%  bench    [kernel.vmlinux]    [k] qp_trie_update_elem
>     13.98%  bench    [kernel.vmlinux]    [k] qp_trie_delete_elem
>      7.96%  bench    [kernel.vmlinux]    [k] native_queued_spin_lock_slowpath
>      5.16%  bench    [kernel.vmlinux]    [k] memcpy_erms
>      5.00%  bench    [kernel.vmlinux]    [k] __kmalloc_node
> 
> The top-2 overheads are due to memory access and atomic ops on
> max_entries. I had tried memory prefetch but it didn't work out, maybe
> I did it wrong. For subtree spinlock overhead, I also had tried the
> hierarchical lock by using hand-over-hand lock scheme, but it didn't
> scale well [1]. I will try to increase the number of subtrees from 256
> to 1024, 4096 or bigger and check whether it makes any difference.
> 
> For atomic ops and kmalloc overhead, I think I can reuse the idea from
> patchset "bpf: BPF specific memory allocator". I have given bpf_mem_alloc
> a simple try and encounter some problems. One problem is that
> immediate reuse of freed object in bpf memory allocator. Because qp-trie
> uses bpf memory allocator to allocate and free qp_trie_branch, if
> qp_trie_branch is reused immediately, the lookup procedure may oops due
> to the incorrect content in qp_trie_branch. And another problem is the
> size limitation in bpf_mem_alloc() is 4096. It may be a little small for
> the total size of key size and value size, but maybe I can use two
> separated bpf_mem_alloc for key and value.

4096 limit for key+value size would be an acceptable trade-off.
With kptrs the user will be able to extend value to much bigger sizes
while doing <= 4096 allocation at a time. Larger allocations are failing
in production more often than not. Any algorithm relying on successful
 >= 4096 allocation is likely to fail. kvmalloc is a fallback that
the kernel is using, but we're not there yet in bpf land.
The benefits of bpf_mem_alloc in qp-trie would be huge though.
qp-trie would work in all contexts including sleepable progs.
As presented the use cases for qp-trie are quite limited.
If I understand correctly the concern for not using bpf_mem_alloc
is that qp_trie_branch can be reused. Can you provide an exact scenario
that will casue issuses?
Instead of call_rcu in qp_trie_branch_free (which will work only for
regular progs and have high overhead as demonstrated by mem_alloc patches)
the qp-trie freeing logic can scrub that element, so it's ready to be
reused as another struct qp_trie_branch.
I guess I'm missing how rcu protects this internal data structures of qp-trie.
The rcu_read_lock of regular bpf prog helps to stay lock-less during lookup?
Is that it?
So to make qp-trie work in sleepable progs the algo would need to
be changed to do both call_rcu and call_rcu_task_trace everywhere
to protect these inner structs?
call_rcu_task_trace can take long time. So qp_trie_branch-s may linger
around. So quick update/delete (in sleepable with call_rcu_task_trace)
may very well exhaust memory. With bpf_mem_alloc we don't have this issue
since rcu_task_trace gp is observed only when freeing into global mem pool.
Say qp-trie just uses bpf_mem_alloc for qp_trie_branch.
What is the worst that can happen? qp_trie_lookup_elem will go into wrong
path, but won't crash, right? Can we do hlist_nulls trick to address that?
In other words bpf_mem_alloc reuse behavior is pretty much SLAB_TYPESAFE_BY_RCU.
Many kernel data structures know how to deal with such object reuse.
We can have a private bpf_mem_alloc here for qp_trie_branch-s only and
construct a logic in a way that obj reuse is not problematic.

Another alternative would be to add explicit rcu_read_lock in qp_trie_lookup_elem
to protect qp_trie_branch during lookup while using bpf_mem_alloc
for both qp_trie_branch and leaf nodes, but that's not a great solution either.
It will allow qp-trie to be usable in sleepable, but use of call_rcu
in update/delete will prevent qp-trie to be usable in tracing progs.

Let's try to brainstorm how to make qp_trie_branch work like SLAB_TYPESAFE_BY_RCU.

Other than this issue the patches look great. This new map would be awesome addition.