From: Hou Tao <houtao1@xxxxxxxxxx> Hi, The implementation of v4 is mainly based on suggestions from Alexi [0]. There are still pending problems for the current implementation as shown in the benchmark result in patch #3, but there was a long time from the posting of v3, so posting v4 here for further disscussions and more suggestions. The first problem is the huge memory usage compared with bpf memory allocator which does immediate reuse: htab-mem-benchmark (reuse-after-RCU-GP): | name | loop (k/s)| average memory (MiB)| peak memory (MiB)| | -- | -- | -- | -- | | no_op | 1159.18 | 0.99 | 0.99 | | overwrite | 11.00 | 2288 | 4109 | | batch_add_batch_del| 8.86 | 1558 | 2763 | | add_del_on_diff_cpu| 4.74 | 11.39 | 14.77 | htab-mem-benchmark (immediate-reuse): | name | loop (k/s)| average memory (MiB)| peak memory (MiB)| | -- | -- | -- | -- | | no_op | 1160.66 | 0.99 | 1.00 | | overwrite | 28.52 | 2.46 | 2.73 | | batch_add_batch_del| 11.50 | 2.69 | 2.95 | | add_del_on_diff_cpu| 3.75 | 15.85 | 24.24 | It seems the direct reason is the slow RCU grace period. During benchmark, the elapsed time when reuse_rcu() callback is called is about 100ms or even more (e.g., 2 seconds). I suspect the global per-bpf-ma spin-lock and the irq-work running in the contex of freeing process will increase the running overhead of bpf program, the running time of getpgid() is increased, the contex switch is slowed down and the RCU grace period increases [1], but I am still diggin into it. Another problem is the performance degradation compared with immediate reuse and the output from perf report shown the per-bpf-ma spin-lock is a top-one hotspot: map_perf_test (reuse-after-RCU-GP) 0:hash_map_perf kmalloc 194677 events per sec map_perf_test (immediate reuse) 2:hash_map_perf kmalloc 384527 events per sec Considering the purpose of introducing per-bpf-ma reusable list is to handle the case in which the allocation and free are done on different CPUs (e.g., add_del_on_diff_cpu) and a per-cpu reuse list will be enough for overwrite & batch_add_batch_del cases. So maybe we could implement a hybrid of global reusable list and per-cpu reusable list and switch between these two kinds of list according to the history of allocation and free frequency. As ususal, suggestions and comments are always welcome. [0]: https://lore.kernel.org/bpf/20230503184841.6mmvdusr3rxiabmu@MacBook-Pro-6.local [1]: https://lore.kernel.org/bpf/1b64fc4e-d92e-de2f-4895-2e0c36427425@xxxxxxxxxxxxxxx Change Log: v4: * no kworker (Alexei) * Use a global reusable list in bpf memory allocator (Alexei) * Remove BPF_MA_FREE_AFTER_RCU_GP flag and do reuse-after-rcu-gp defaultly in bpf memory allocator (Alexei) * add benchmark results from map_perf_test (Alexei) v3: https://lore.kernel.org/bpf/20230429101215.111262-1-houtao@xxxxxxxxxxxxxxx/ * add BPF_MA_FREE_AFTER_RCU_GP bpf memory allocator * Update htab memory benchmark * move the benchmark patch to the last patch * remove array and useless bpf_map_lookup_elem(&array, ...) in bpf programs * add synchronization between addition CPU and deletion CPU for add_del_on_diff_cpu case to prevent unnecessary loop * add the benchmark result for "extra call_rcu + bpf ma" v2: https://lore.kernel.org/bpf/20230408141846.1878768-1-houtao@xxxxxxxxxxxxxxx/ * add a benchmark for bpf memory allocator to compare between different flavor of bpf memory allocator. * implement BPF_MA_REUSE_AFTER_RCU_GP for bpf memory allocator. v1: https://lore.kernel.org/bpf/20221230041151.1231169-1-houtao@xxxxxxxxxxxxxxx/ Hou Tao (3): bpf: Factor out a common helper free_all() selftests/bpf: Add benchmark for bpf memory allocator bpf: Only reuse after one RCU GP in bpf memory allocator include/linux/bpf_mem_alloc.h | 4 + kernel/bpf/memalloc.c | 385 ++++++++++++------ tools/testing/selftests/bpf/Makefile | 3 + tools/testing/selftests/bpf/bench.c | 4 + .../selftests/bpf/benchs/bench_htab_mem.c | 352 ++++++++++++++++ .../bpf/benchs/run_bench_htab_mem.sh | 42 ++ .../selftests/bpf/progs/htab_mem_bench.c | 135 ++++++ 7 files changed, 809 insertions(+), 116 deletions(-) create mode 100644 tools/testing/selftests/bpf/benchs/bench_htab_mem.c create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_htab_mem.sh create mode 100644 tools/testing/selftests/bpf/progs/htab_mem_bench.c -- 2.29.2