Re: [PATCH bpf-next v2 00/13] Add support for qp-trie with dynptr key

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Fri, 7 Oct 2022 18:59:08 -0700

On Fri, Oct 7, 2022 at 6:56 PM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote:
>
> Hi,
>
> On 9/29/2022 11:22 AM, Alexei Starovoitov wrote:
> > On Wed, Sep 28, 2022 at 1:46 AM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote:
> >> Hi,
> >>
> >> On 9/28/2022 9:08 AM, Alexei Starovoitov wrote:
> >>> On Tue, Sep 27, 2022 at 7:08 AM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote:
> >>>
> >>> Looks like the perf is lost on atomic_inc/dec.
> >>> Try a partial revert of mem_alloc.
> >>> In particular to make sure
> >>> commit 0fd7c5d43339 ("bpf: Optimize call_rcu in non-preallocated hash map.")
> >>> is reverted and call_rcu is in place,
> >>> but percpu counter optimization is still there.
> >>> Also please use 'map_perf_test 4'.
> >>> I doubt 1000 vs 10240 will make a difference, but still.
> >>>
> >> I have tried the following two setups:
> >> (1) Don't use bpf_mem_alloc in hash-map and use per-cpu counter in hash-map
> >> # Samples: 1M of event 'cycles:ppp'
> >> # Event count (approx.): 1041345723234
> >> #
> >> # Overhead  Command          Shared Object                                Symbol
> >> # ........  ...............  ...........................................
> >> ...............................................
> >> #
> >>     10.36%  map_perf_test    [kernel.vmlinux]                             [k]
> >> bpf_map_get_memcg.isra.0
> > That is per-cpu counter and it's consuming 10% ?!
> > Something is really odd in your setup.
> > A lot of debug configs?
> Sorry for the late reply. Just back to work from a long vacation.
>
> My local .config is derived from Fedora distribution. It indeed has some DEBUG
> related configs. Will turn these configs off to check it again :)
> >>      9.82%  map_perf_test    [kernel.vmlinux]                             [k]
> >> bpf_map_kmalloc_node
> >>      4.24%  map_perf_test    [kernel.vmlinux]                             [k]
> >> check_preemption_disabled
> > clearly debug build.
> > Please use production build.
> check_preemption_disabled is due to CONFIG_DEBUG_PREEMPT. And it is enabled on
> Fedora distribution.
> >>      2.86%  map_perf_test    [kernel.vmlinux]                             [k]
> >> htab_map_update_elem
> >>      2.80%  map_perf_test    [kernel.vmlinux]                             [k]
> >> __kmalloc_node
> >>      2.72%  map_perf_test    [kernel.vmlinux]                             [k]
> >> htab_map_delete_elem
> >>      2.30%  map_perf_test    [kernel.vmlinux]                             [k]
> >> memcg_slab_post_alloc_hook
> >>      2.21%  map_perf_test    [kernel.vmlinux]                             [k]
> >> entry_SYSCALL_64
> >>      2.17%  map_perf_test    [kernel.vmlinux]                             [k]
> >> syscall_exit_to_user_mode
> >>      2.12%  map_perf_test    [kernel.vmlinux]                             [k] jhash
> >>      2.11%  map_perf_test    [kernel.vmlinux]                             [k]
> >> syscall_return_via_sysret
> >>      2.05%  map_perf_test    [kernel.vmlinux]                             [k]
> >> alloc_htab_elem
> >>      1.94%  map_perf_test    [kernel.vmlinux]                             [k]
> >> _raw_spin_lock_irqsave
> >>      1.92%  map_perf_test    [kernel.vmlinux]                             [k]
> >> preempt_count_add
> >>      1.92%  map_perf_test    [kernel.vmlinux]                             [k]
> >> preempt_count_sub
> >>      1.87%  map_perf_test    [kernel.vmlinux]                             [k]
> >> call_rcu
> SNIP
> >> Maybe add a not-immediate-reuse flag support to bpf_mem_alloc is reason. What do
> >> you think ?
> > We've discussed it twice already. It's not an option due to OOM
> > and performance considerations.
> > call_rcu doesn't scale to millions a second.
> Understand. I was just trying to understand the exact performance overhead of
> call_rcu(). If the overhead of map operations are much greater than the overhead
> of call_rcu(), I think calling call_rcu() one millions a second will be not a
> problem and  it also makes the implementation of qp-trie being much simpler. The
> OOM problem is indeed a problem, although it is also possible for the current
> implementation, so I will try to implement the lookup procedure which handles
> the reuse problem.

call_rcu is not just that particular function.
It's all the work rcu subsystem needs to do to observe gp
and execute that callback. Just see how many kthreads it will
start when overloaded like this.