Hi, On 9/29/2022 11:22 AM, Alexei Starovoitov wrote: > On Wed, Sep 28, 2022 at 1:46 AM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote: >> Hi, >> >> On 9/28/2022 9:08 AM, Alexei Starovoitov wrote: >>> On Tue, Sep 27, 2022 at 7:08 AM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote: >>> >>> Looks like the perf is lost on atomic_inc/dec. >>> Try a partial revert of mem_alloc. >>> In particular to make sure >>> commit 0fd7c5d43339 ("bpf: Optimize call_rcu in non-preallocated hash map.") >>> is reverted and call_rcu is in place, >>> but percpu counter optimization is still there. >>> Also please use 'map_perf_test 4'. >>> I doubt 1000 vs 10240 will make a difference, but still. >>> >> I have tried the following two setups: >> (1) Don't use bpf_mem_alloc in hash-map and use per-cpu counter in hash-map >> # Samples: 1M of event 'cycles:ppp' >> # Event count (approx.): 1041345723234 >> # >> # Overhead Command Shared Object Symbol >> # ........ ............... ........................................... >> ............................................... >> # >> 10.36% map_perf_test [kernel.vmlinux] [k] >> bpf_map_get_memcg.isra.0 > That is per-cpu counter and it's consuming 10% ?! > Something is really odd in your setup. > A lot of debug configs? Sorry for the late reply. Just back to work from a long vacation. My local .config is derived from Fedora distribution. It indeed has some DEBUG related configs. Will turn these configs off to check it again :) >> 9.82% map_perf_test [kernel.vmlinux] [k] >> bpf_map_kmalloc_node >> 4.24% map_perf_test [kernel.vmlinux] [k] >> check_preemption_disabled > clearly debug build. > Please use production build. check_preemption_disabled is due to CONFIG_DEBUG_PREEMPT. And it is enabled on Fedora distribution. >> 2.86% map_perf_test [kernel.vmlinux] [k] >> htab_map_update_elem >> 2.80% map_perf_test [kernel.vmlinux] [k] >> __kmalloc_node >> 2.72% map_perf_test [kernel.vmlinux] [k] >> htab_map_delete_elem >> 2.30% map_perf_test [kernel.vmlinux] [k] >> memcg_slab_post_alloc_hook >> 2.21% map_perf_test [kernel.vmlinux] [k] >> entry_SYSCALL_64 >> 2.17% map_perf_test [kernel.vmlinux] [k] >> syscall_exit_to_user_mode >> 2.12% map_perf_test [kernel.vmlinux] [k] jhash >> 2.11% map_perf_test [kernel.vmlinux] [k] >> syscall_return_via_sysret >> 2.05% map_perf_test [kernel.vmlinux] [k] >> alloc_htab_elem >> 1.94% map_perf_test [kernel.vmlinux] [k] >> _raw_spin_lock_irqsave >> 1.92% map_perf_test [kernel.vmlinux] [k] >> preempt_count_add >> 1.92% map_perf_test [kernel.vmlinux] [k] >> preempt_count_sub >> 1.87% map_perf_test [kernel.vmlinux] [k] >> call_rcu SNIP >> Maybe add a not-immediate-reuse flag support to bpf_mem_alloc is reason. What do >> you think ? > We've discussed it twice already. It's not an option due to OOM > and performance considerations. > call_rcu doesn't scale to millions a second. Understand. I was just trying to understand the exact performance overhead of call_rcu(). If the overhead of map operations are much greater than the overhead of call_rcu(), I think calling call_rcu() one millions a second will be not a problem and it also makes the implementation of qp-trie being much simpler. The OOM problem is indeed a problem, although it is also possible for the current implementation, so I will try to implement the lookup procedure which handles the reuse problem. Regards. Tao > .