On Fri, Oct 07, 2022 at 06:59:08PM -0700, Alexei Starovoitov wrote: > On Fri, Oct 7, 2022 at 6:56 PM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote: > > > > Hi, > > > > On 9/29/2022 11:22 AM, Alexei Starovoitov wrote: > > > On Wed, Sep 28, 2022 at 1:46 AM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote: > > >> Hi, > > >> > > >> On 9/28/2022 9:08 AM, Alexei Starovoitov wrote: > > >>> On Tue, Sep 27, 2022 at 7:08 AM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote: > > >>> > > >>> Looks like the perf is lost on atomic_inc/dec. > > >>> Try a partial revert of mem_alloc. > > >>> In particular to make sure > > >>> commit 0fd7c5d43339 ("bpf: Optimize call_rcu in non-preallocated hash map.") > > >>> is reverted and call_rcu is in place, > > >>> but percpu counter optimization is still there. > > >>> Also please use 'map_perf_test 4'. > > >>> I doubt 1000 vs 10240 will make a difference, but still. > > >>> > > >> I have tried the following two setups: > > >> (1) Don't use bpf_mem_alloc in hash-map and use per-cpu counter in hash-map > > >> # Samples: 1M of event 'cycles:ppp' > > >> # Event count (approx.): 1041345723234 > > >> # > > >> # Overhead Command Shared Object Symbol > > >> # ........ ............... ........................................... > > >> ............................................... > > >> # > > >> 10.36% map_perf_test [kernel.vmlinux] [k] > > >> bpf_map_get_memcg.isra.0 > > > That is per-cpu counter and it's consuming 10% ?! > > > Something is really odd in your setup. > > > A lot of debug configs? > > Sorry for the late reply. Just back to work from a long vacation. > > > > My local .config is derived from Fedora distribution. It indeed has some DEBUG > > related configs. Will turn these configs off to check it again :) > > >> 9.82% map_perf_test [kernel.vmlinux] [k] > > >> bpf_map_kmalloc_node > > >> 4.24% map_perf_test [kernel.vmlinux] [k] > > >> check_preemption_disabled > > > clearly debug build. > > > Please use production build. > > check_preemption_disabled is due to CONFIG_DEBUG_PREEMPT. And it is enabled on > > Fedora distribution. > > >> 2.86% map_perf_test [kernel.vmlinux] [k] > > >> htab_map_update_elem > > >> 2.80% map_perf_test [kernel.vmlinux] [k] > > >> __kmalloc_node > > >> 2.72% map_perf_test [kernel.vmlinux] [k] > > >> htab_map_delete_elem > > >> 2.30% map_perf_test [kernel.vmlinux] [k] > > >> memcg_slab_post_alloc_hook > > >> 2.21% map_perf_test [kernel.vmlinux] [k] > > >> entry_SYSCALL_64 > > >> 2.17% map_perf_test [kernel.vmlinux] [k] > > >> syscall_exit_to_user_mode > > >> 2.12% map_perf_test [kernel.vmlinux] [k] jhash > > >> 2.11% map_perf_test [kernel.vmlinux] [k] > > >> syscall_return_via_sysret > > >> 2.05% map_perf_test [kernel.vmlinux] [k] > > >> alloc_htab_elem > > >> 1.94% map_perf_test [kernel.vmlinux] [k] > > >> _raw_spin_lock_irqsave > > >> 1.92% map_perf_test [kernel.vmlinux] [k] > > >> preempt_count_add > > >> 1.92% map_perf_test [kernel.vmlinux] [k] > > >> preempt_count_sub > > >> 1.87% map_perf_test [kernel.vmlinux] [k] > > >> call_rcu > > SNIP > > >> Maybe add a not-immediate-reuse flag support to bpf_mem_alloc is reason. What do > > >> you think ? > > > We've discussed it twice already. It's not an option due to OOM > > > and performance considerations. > > > call_rcu doesn't scale to millions a second. > > Understand. I was just trying to understand the exact performance overhead of > > call_rcu(). If the overhead of map operations are much greater than the overhead > > of call_rcu(), I think calling call_rcu() one millions a second will be not a > > problem and it also makes the implementation of qp-trie being much simpler. The > > OOM problem is indeed a problem, although it is also possible for the current > > implementation, so I will try to implement the lookup procedure which handles > > the reuse problem. > > call_rcu is not just that particular function. > It's all the work rcu subsystem needs to do to observe gp > and execute that callback. Just see how many kthreads it will > start when overloaded like this. The kthreads to watch include rcu_preempt, rcu_sched, ksoftirqd*, rcuc*, and rcuo*. There is also the back-of-interrupt softirq context, which requires some care to measure accurately. The possibility of SLAB_TYPESAFE_BY_RCU has been discussed. I take it that the per-element locking overhead for exact iterations was a problem? If so, what exactly are the consistency rules for iteration? Presumably stronger than "if the element existed throughout, it is included in the iteration; if it did not exist throughout, it is not included; otherwise it might or might not be included" given that you get that for free. Either way, could you please tell me the exact iteration rules? Thanx, Paul