On Wed, Sep 28, 2022 at 1:46 AM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote: > > Hi, > > On 9/28/2022 9:08 AM, Alexei Starovoitov wrote: > > On Tue, Sep 27, 2022 at 7:08 AM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote: > > > SNIP > >> I can not reproduce the phenomenon that call_rcu consumes 100% of all cpus in my > >> local environment, could you share the setup for it ? > >> > >> The following is the output of perf report (--no-children) for "./map_perf_test > >> 4 72 10240 100000" on a x86-64 host with 72-cpus: > >> > >> 26.63% map_perf_test [kernel.vmlinux] [k] > >> alloc_htab_elem > >> 21.57% map_perf_test [kernel.vmlinux] [k] > >> htab_map_update_elem > > Looks like the perf is lost on atomic_inc/dec. > > Try a partial revert of mem_alloc. > > In particular to make sure > > commit 0fd7c5d43339 ("bpf: Optimize call_rcu in non-preallocated hash map.") > > is reverted and call_rcu is in place, > > but percpu counter optimization is still there. > > Also please use 'map_perf_test 4'. > > I doubt 1000 vs 10240 will make a difference, but still. > > > I have tried the following two setups: > (1) Don't use bpf_mem_alloc in hash-map and use per-cpu counter in hash-map > # Samples: 1M of event 'cycles:ppp' > # Event count (approx.): 1041345723234 > # > # Overhead Command Shared Object Symbol > # ........ ............... ........................................... > ............................................... > # > 10.36% map_perf_test [kernel.vmlinux] [k] > bpf_map_get_memcg.isra.0 That is per-cpu counter and it's consuming 10% ?! Something is really odd in your setup. A lot of debug configs? > 9.82% map_perf_test [kernel.vmlinux] [k] > bpf_map_kmalloc_node > 4.24% map_perf_test [kernel.vmlinux] [k] > check_preemption_disabled clearly debug build. Please use production build. > 2.86% map_perf_test [kernel.vmlinux] [k] > htab_map_update_elem > 2.80% map_perf_test [kernel.vmlinux] [k] > __kmalloc_node > 2.72% map_perf_test [kernel.vmlinux] [k] > htab_map_delete_elem > 2.30% map_perf_test [kernel.vmlinux] [k] > memcg_slab_post_alloc_hook > 2.21% map_perf_test [kernel.vmlinux] [k] > entry_SYSCALL_64 > 2.17% map_perf_test [kernel.vmlinux] [k] > syscall_exit_to_user_mode > 2.12% map_perf_test [kernel.vmlinux] [k] jhash > 2.11% map_perf_test [kernel.vmlinux] [k] > syscall_return_via_sysret > 2.05% map_perf_test [kernel.vmlinux] [k] > alloc_htab_elem > 1.94% map_perf_test [kernel.vmlinux] [k] > _raw_spin_lock_irqsave > 1.92% map_perf_test [kernel.vmlinux] [k] > preempt_count_add > 1.92% map_perf_test [kernel.vmlinux] [k] > preempt_count_sub > 1.87% map_perf_test [kernel.vmlinux] [k] > call_rcu > > > (2) Use bpf_mem_alloc & per-cpu counter in hash-map, but no batch call_rcu > optimization > By revert the following commits: > > 9f2c6e96c65e bpf: Optimize rcu_barrier usage between hash map and bpf_mem_alloc. > bfc03c15bebf bpf: Remove usage of kmem_cache from bpf_mem_cache. > 02cc5aa29e8c bpf: Remove prealloc-only restriction for sleepable bpf programs. > dccb4a9013a6 bpf: Prepare bpf_mem_alloc to be used by sleepable bpf programs. > 96da3f7d489d bpf: Remove tracing program restriction on map types > ee4ed53c5eb6 bpf: Convert percpu hash map to per-cpu bpf_mem_alloc. > 4ab67149f3c6 bpf: Add percpu allocation support to bpf_mem_alloc. > 8d5a8011b35d bpf: Batch call_rcu callbacks instead of SLAB_TYPESAFE_BY_RCU. > 7c266178aa51 bpf: Adjust low/high watermarks in bpf_mem_cache > 0fd7c5d43339 bpf: Optimize call_rcu in non-preallocated hash map. > > 5.17% map_perf_test [kernel.vmlinux] [k] > check_preemption_disabled > 4.53% map_perf_test [kernel.vmlinux] [k] > __get_obj_cgroup_from_memcg > 2.97% map_perf_test [kernel.vmlinux] [k] > htab_map_update_elem > 2.74% map_perf_test [kernel.vmlinux] [k] > htab_map_delete_elem > 2.62% map_perf_test [kernel.vmlinux] [k] > kmem_cache_alloc_node > 2.57% map_perf_test [kernel.vmlinux] [k] > memcg_slab_post_alloc_hook > 2.34% map_perf_test [kernel.vmlinux] [k] jhash > 2.30% map_perf_test [kernel.vmlinux] [k] > entry_SYSCALL_64 > 2.25% map_perf_test [kernel.vmlinux] [k] > obj_cgroup_charge > 2.23% map_perf_test [kernel.vmlinux] [k] > alloc_htab_elem > 2.17% map_perf_test [kernel.vmlinux] [k] > memcpy_erms > 2.17% map_perf_test [kernel.vmlinux] [k] > syscall_exit_to_user_mode > 2.16% map_perf_test [kernel.vmlinux] [k] > syscall_return_via_sysret > 2.14% map_perf_test [kernel.vmlinux] [k] > _raw_spin_lock_irqsave > 2.13% map_perf_test [kernel.vmlinux] [k] > preempt_count_add > 2.12% map_perf_test [kernel.vmlinux] [k] > preempt_count_sub > 2.00% map_perf_test [kernel.vmlinux] [k] > percpu_counter_add_batch > 1.99% map_perf_test [kernel.vmlinux] [k] > alloc_bulk > 1.97% map_perf_test [kernel.vmlinux] [k] > call_rcu > 1.52% map_perf_test [kernel.vmlinux] [k] > mod_objcg_state > 1.36% map_perf_test [kernel.vmlinux] [k] > allocate_slab > > In both of these two setups, the overhead of call_rcu is about 2% and it is not > the biggest overhead. > > Maybe add a not-immediate-reuse flag support to bpf_mem_alloc is reason. What do > you think ? We've discussed it twice already. It's not an option due to OOM and performance considerations. call_rcu doesn't scale to millions a second.