Re: [PATCH bpf-next v2 00/13] Add support for qp-trie with dynptr key

Hou Tao <houtao@xxxxxxxxxxxxxxx> · Wed, 28 Sep 2022 11:27:19 +0800



Hi,

On 9/28/2022 9:08 AM, Alexei Starovoitov wrote:
> On Tue, Sep 27, 2022 at 7:08 AM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote:
>> A quick benchmark show the performance is bad when using subtree lock for lookup:
>>
>> Randomly-generated binary data (key size=255, max entries=16K, key length
>> range:[1, 255])
>> * no lock
>> qp-trie lookup   (1  thread)   10.250 ± 0.009M/s (drops 0.006 ± 0.000M/s mem
>> 0.000 MiB)
>> qp-trie lookup   (2  thread)   20.466 ± 0.009M/s (drops 0.010 ± 0.000M/s mem
>> 0.000 MiB)
>> qp-trie lookup   (4  thread)   41.211 ± 0.010M/s (drops 0.018 ± 0.000M/s mem
>> 0.000 MiB)
>> qp-trie lookup   (8  thread)   82.933 ± 0.409M/s (drops 0.031 ± 0.000M/s mem
>> 0.000 MiB)
>> qp-trie lookup   (16 thread)  162.615 ± 0.842M/s (drops 0.070 ± 0.000M/s mem
>> 0.000 MiB)
>>
>> * subtree lock
>> qp-trie lookup   (1  thread)    8.990 ± 0.506M/s (drops 0.006 ± 0.000M/s mem
>> 0.000 MiB)
>> qp-trie lookup   (2  thread)   15.908 ± 0.141M/s (drops 0.004 ± 0.000M/s mem
>> 0.000 MiB)
>> qp-trie lookup   (4  thread)   27.551 ± 0.025M/s (drops 0.019 ± 0.000M/s mem
>> 0.000 MiB)
>> qp-trie lookup   (8  thread)   42.040 ± 0.241M/s (drops 0.018 ± 0.000M/s mem
>> 0.000 MiB)
>> qp-trie lookup   (16 thread)   50.884 ± 0.171M/s (drops 0.012 ± 0.000M/s mem
>> 0.000 MiB)
> That's indeed significant.
> But I interpret it differently.
> Since single thread perf is close enough while 16 thread
> suffers 3x it means the lock mechanism is inefficient.
> It means update/delete performance equally doesn't scale.
>
>> Strings in /proc/kallsyms (key size=83, max entries=170958)
>> * no lock
>> qp-trie lookup   (1  thread)    4.096 ± 0.234M/s (drops 0.249 ± 0.014M/s mem
>> 0.000 MiB)
>>
>> * subtree lock
>> qp-trie lookup   (1  thread)    4.454 ± 0.108M/s (drops 0.271 ± 0.007M/s mem
>> 0.000 MiB)
> Here a single thread with spin_lock is _faster_ than without.
It is a little strange because there is not any overhead for lockless lookup for
now. I thought it may be due to CPU boost or something similar. Will check it again.
> So it's not about the cost of spin_lock, but its contention.
> So all the complexity to do lockless lookup
> needs to be considered in this context.
> Looks like update/delete don't scale anyway.
Yes.
> So lock-less lookup complexity is justified only
> for the case with a lot of concurrent lookups and
> little update/delete.
> When bpf hash map was added the majority of tracing use cases
> had # of lookups == # of updates == # of deletes.
> For qp-trie we obviously cannot predict,
> but should not pivot strongly into lock-less lookup without data.
> The lock-less lookup should have reasonable complexity.
> Otherwise let's do spin_lock and work on a different locking scheme
> for all operations (lookup/update/delete).
For lpm-trie, does the use case in cillium [0] has many lockless lookups and few
updates/deletions ? So I think the use case is applicable for qp-trie as well.
Before steering towards spin_lock, let's implement a demo for lock-less lookup
to see its complexity. For spin-lock way, I had implemented a hand-over-hand
lock but the lookup was still lockless, but it doesn't scale well as well.
>
>> I can not reproduce the phenomenon that call_rcu consumes 100% of all cpus in my
>> local environment, could you share the setup for it ?
>>
>> The following is the output of perf report (--no-children) for "./map_perf_test
>> 4 72 10240 100000" on a x86-64 host with 72-cpus:
>>
>>     26.63%  map_perf_test    [kernel.vmlinux]                             [k]
>> alloc_htab_elem
>>     21.57%  map_perf_test    [kernel.vmlinux]                             [k]
>> htab_map_update_elem
> Looks like the perf is lost on atomic_inc/dec.
> Try a partial revert of mem_alloc.
> In particular to make sure
> commit 0fd7c5d43339 ("bpf: Optimize call_rcu in non-preallocated hash map.")
> is reverted and call_rcu is in place,
> but percpu counter optimization is still there.
> Also please use 'map_perf_test 4'.
> I doubt 1000 vs 10240 will make a difference, but still.
Will do. The suggestion is reasonable. But when max_entries=1000,
use_percpu_counter will be false.