Re: [PATCH bpf-next v2 00/13] Add support for qp-trie with dynptr key

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Tue, 27 Sep 2022 18:08:26 -0700

On Tue, Sep 27, 2022 at 7:08 AM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote:
>
> A quick benchmark show the performance is bad when using subtree lock for lookup:
>
> Randomly-generated binary data (key size=255, max entries=16K, key length
> range:[1, 255])
> * no lock
> qp-trie lookup   (1  thread)   10.250 ± 0.009M/s (drops 0.006 ± 0.000M/s mem
> 0.000 MiB)
> qp-trie lookup   (2  thread)   20.466 ± 0.009M/s (drops 0.010 ± 0.000M/s mem
> 0.000 MiB)
> qp-trie lookup   (4  thread)   41.211 ± 0.010M/s (drops 0.018 ± 0.000M/s mem
> 0.000 MiB)
> qp-trie lookup   (8  thread)   82.933 ± 0.409M/s (drops 0.031 ± 0.000M/s mem
> 0.000 MiB)
> qp-trie lookup   (16 thread)  162.615 ± 0.842M/s (drops 0.070 ± 0.000M/s mem
> 0.000 MiB)
>
> * subtree lock
> qp-trie lookup   (1  thread)    8.990 ± 0.506M/s (drops 0.006 ± 0.000M/s mem
> 0.000 MiB)
> qp-trie lookup   (2  thread)   15.908 ± 0.141M/s (drops 0.004 ± 0.000M/s mem
> 0.000 MiB)
> qp-trie lookup   (4  thread)   27.551 ± 0.025M/s (drops 0.019 ± 0.000M/s mem
> 0.000 MiB)
> qp-trie lookup   (8  thread)   42.040 ± 0.241M/s (drops 0.018 ± 0.000M/s mem
> 0.000 MiB)
> qp-trie lookup   (16 thread)   50.884 ± 0.171M/s (drops 0.012 ± 0.000M/s mem
> 0.000 MiB)

That's indeed significant.
But I interpret it differently.
Since single thread perf is close enough while 16 thread
suffers 3x it means the lock mechanism is inefficient.
It means update/delete performance equally doesn't scale.

> Strings in /proc/kallsyms (key size=83, max entries=170958)
> * no lock
> qp-trie lookup   (1  thread)    4.096 ± 0.234M/s (drops 0.249 ± 0.014M/s mem
> 0.000 MiB)
>
> * subtree lock
> qp-trie lookup   (1  thread)    4.454 ± 0.108M/s (drops 0.271 ± 0.007M/s mem
> 0.000 MiB)

Here a single thread with spin_lock is _faster_ than without.
So it's not about the cost of spin_lock, but its contention.
So all the complexity to do lockless lookup
needs to be considered in this context.
Looks like update/delete don't scale anyway.
So lock-less lookup complexity is justified only
for the case with a lot of concurrent lookups and
little update/delete.
When bpf hash map was added the majority of tracing use cases
had # of lookups == # of updates == # of deletes.
For qp-trie we obviously cannot predict,
but should not pivot strongly into lock-less lookup without data.
The lock-less lookup should have reasonable complexity.
Otherwise let's do spin_lock and work on a different locking scheme
for all operations (lookup/update/delete).

> I can not reproduce the phenomenon that call_rcu consumes 100% of all cpus in my
> local environment, could you share the setup for it ?
>
> The following is the output of perf report (--no-children) for "./map_perf_test
> 4 72 10240 100000" on a x86-64 host with 72-cpus:
>
>     26.63%  map_perf_test    [kernel.vmlinux]                             [k]
> alloc_htab_elem
>     21.57%  map_perf_test    [kernel.vmlinux]                             [k]
> htab_map_update_elem

Looks like the perf is lost on atomic_inc/dec.
Try a partial revert of mem_alloc.
In particular to make sure
commit 0fd7c5d43339 ("bpf: Optimize call_rcu in non-preallocated hash map.")
is reverted and call_rcu is in place,
but percpu counter optimization is still there.
Also please use 'map_perf_test 4'.
I doubt 1000 vs 10240 will make a difference, but still.