Hi, On 9/28/2022 9:08 AM, Alexei Starovoitov wrote: > On Tue, Sep 27, 2022 at 7:08 AM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote: > SNIP >> I can not reproduce the phenomenon that call_rcu consumes 100% of all cpus in my >> local environment, could you share the setup for it ? >> >> The following is the output of perf report (--no-children) for "./map_perf_test >> 4 72 10240 100000" on a x86-64 host with 72-cpus: >> >> 26.63% map_perf_test [kernel.vmlinux] [k] >> alloc_htab_elem >> 21.57% map_perf_test [kernel.vmlinux] [k] >> htab_map_update_elem > Looks like the perf is lost on atomic_inc/dec. > Try a partial revert of mem_alloc. > In particular to make sure > commit 0fd7c5d43339 ("bpf: Optimize call_rcu in non-preallocated hash map.") > is reverted and call_rcu is in place, > but percpu counter optimization is still there. > Also please use 'map_perf_test 4'. > I doubt 1000 vs 10240 will make a difference, but still. > I have tried the following two setups: (1) Don't use bpf_mem_alloc in hash-map and use per-cpu counter in hash-map # Samples: 1M of event 'cycles:ppp' # Event count (approx.): 1041345723234 # # Overhead Command Shared Object Symbol # ........ ............... ........................................... ............................................... # 10.36% map_perf_test [kernel.vmlinux] [k] bpf_map_get_memcg.isra.0 9.82% map_perf_test [kernel.vmlinux] [k] bpf_map_kmalloc_node 4.24% map_perf_test [kernel.vmlinux] [k] check_preemption_disabled 2.86% map_perf_test [kernel.vmlinux] [k] htab_map_update_elem 2.80% map_perf_test [kernel.vmlinux] [k] __kmalloc_node 2.72% map_perf_test [kernel.vmlinux] [k] htab_map_delete_elem 2.30% map_perf_test [kernel.vmlinux] [k] memcg_slab_post_alloc_hook 2.21% map_perf_test [kernel.vmlinux] [k] entry_SYSCALL_64 2.17% map_perf_test [kernel.vmlinux] [k] syscall_exit_to_user_mode 2.12% map_perf_test [kernel.vmlinux] [k] jhash 2.11% map_perf_test [kernel.vmlinux] [k] syscall_return_via_sysret 2.05% map_perf_test [kernel.vmlinux] [k] alloc_htab_elem 1.94% map_perf_test [kernel.vmlinux] [k] _raw_spin_lock_irqsave 1.92% map_perf_test [kernel.vmlinux] [k] preempt_count_add 1.92% map_perf_test [kernel.vmlinux] [k] preempt_count_sub 1.87% map_perf_test [kernel.vmlinux] [k] call_rcu (2) Use bpf_mem_alloc & per-cpu counter in hash-map, but no batch call_rcu optimization By revert the following commits: 9f2c6e96c65e bpf: Optimize rcu_barrier usage between hash map and bpf_mem_alloc. bfc03c15bebf bpf: Remove usage of kmem_cache from bpf_mem_cache. 02cc5aa29e8c bpf: Remove prealloc-only restriction for sleepable bpf programs. dccb4a9013a6 bpf: Prepare bpf_mem_alloc to be used by sleepable bpf programs. 96da3f7d489d bpf: Remove tracing program restriction on map types ee4ed53c5eb6 bpf: Convert percpu hash map to per-cpu bpf_mem_alloc. 4ab67149f3c6 bpf: Add percpu allocation support to bpf_mem_alloc. 8d5a8011b35d bpf: Batch call_rcu callbacks instead of SLAB_TYPESAFE_BY_RCU. 7c266178aa51 bpf: Adjust low/high watermarks in bpf_mem_cache 0fd7c5d43339 bpf: Optimize call_rcu in non-preallocated hash map. 5.17% map_perf_test [kernel.vmlinux] [k] check_preemption_disabled 4.53% map_perf_test [kernel.vmlinux] [k] __get_obj_cgroup_from_memcg 2.97% map_perf_test [kernel.vmlinux] [k] htab_map_update_elem 2.74% map_perf_test [kernel.vmlinux] [k] htab_map_delete_elem 2.62% map_perf_test [kernel.vmlinux] [k] kmem_cache_alloc_node 2.57% map_perf_test [kernel.vmlinux] [k] memcg_slab_post_alloc_hook 2.34% map_perf_test [kernel.vmlinux] [k] jhash 2.30% map_perf_test [kernel.vmlinux] [k] entry_SYSCALL_64 2.25% map_perf_test [kernel.vmlinux] [k] obj_cgroup_charge 2.23% map_perf_test [kernel.vmlinux] [k] alloc_htab_elem 2.17% map_perf_test [kernel.vmlinux] [k] memcpy_erms 2.17% map_perf_test [kernel.vmlinux] [k] syscall_exit_to_user_mode 2.16% map_perf_test [kernel.vmlinux] [k] syscall_return_via_sysret 2.14% map_perf_test [kernel.vmlinux] [k] _raw_spin_lock_irqsave 2.13% map_perf_test [kernel.vmlinux] [k] preempt_count_add 2.12% map_perf_test [kernel.vmlinux] [k] preempt_count_sub 2.00% map_perf_test [kernel.vmlinux] [k] percpu_counter_add_batch 1.99% map_perf_test [kernel.vmlinux] [k] alloc_bulk 1.97% map_perf_test [kernel.vmlinux] [k] call_rcu 1.52% map_perf_test [kernel.vmlinux] [k] mod_objcg_state 1.36% map_perf_test [kernel.vmlinux] [k] allocate_slab In both of these two setups, the overhead of call_rcu is about 2% and it is not the biggest overhead. Maybe add a not-immediate-reuse flag support to bpf_mem_alloc is reason. What do you think ?