Hi, On 6/6/2023 11:53 AM, Hou Tao wrote: > From: Hou Tao <houtao1@xxxxxxxxxx> > > Hi, > > The implementation of v4 is mainly based on suggestions from Alexi [0]. > There are still pending problems for the current implementation as shown > in the benchmark result in patch #3, but there was a long time from the > posting of v3, so posting v4 here for further disscussions and more > suggestions. > > The first problem is the huge memory usage compared with bpf memory > allocator which does immediate reuse: > > htab-mem-benchmark (reuse-after-RCU-GP): > | name | loop (k/s)| average memory (MiB)| peak memory (MiB)| > | -- | -- | -- | -- | > | no_op | 1159.18 | 0.99 | 0.99 | > | overwrite | 11.00 | 2288 | 4109 | > | batch_add_batch_del| 8.86 | 1558 | 2763 | > | add_del_on_diff_cpu| 4.74 | 11.39 | 14.77 | > > htab-mem-benchmark (immediate-reuse): > | name | loop (k/s)| average memory (MiB)| peak memory (MiB)| > | -- | -- | -- | -- | > | no_op | 1160.66 | 0.99 | 1.00 | > | overwrite | 28.52 | 2.46 | 2.73 | > | batch_add_batch_del| 11.50 | 2.69 | 2.95 | > | add_del_on_diff_cpu| 3.75 | 15.85 | 24.24 | > > It seems the direct reason is the slow RCU grace period. During > benchmark, the elapsed time when reuse_rcu() callback is called is about > 100ms or even more (e.g., 2 seconds). I suspect the global per-bpf-ma > spin-lock and the irq-work running in the contex of freeing process will > increase the running overhead of bpf program, the running time of > getpgid() is increased, the contex switch is slowed down and the RCU > grace period increases [1], but I am still diggin into it. For reuse-after-RCU-GP flavor, by removing per-bpf-ma reusable list (namely bpf_mem_shared_cache) and using per-cpu reusable list (like v3 did) instead, the memory usage of htab-mem-benchmark will decrease a lot: htab-mem-benchmark (reuse-after-RCU-GP + per-cpu reusable list): | name | loop (k/s)| average memory (MiB)| peak memory (MiB)| | -- | -- | -- | -- | | no_op | 1165.38 | 0.97 | 1.00 | | overwrite | 17.25 | 626.41 | 781.82 | | batch_add_batch_del| 11.51 | 398.56 | 500.29 | | add_del_on_diff_cpu| 4.21 | 31.06 | 48.84 | But the memory usage is still large compared with v3 and the elapsed time of reuse_rcu() callback is about 90~200ms. Compared with v3, there are still two differences: 1) v3 uses kmalloc() to allocate multiple inflight RCU callbacks to accelerate the reuse of freed objects. 2) v3 uses kworker instead of irq_work for free procedure. For 1), after using kmalloc() in irq_work to allocate multiple inflight RCU callbacks (namely reuse_rcu()), the memory usage decreases a bit, but is not enough: htab-mem-benchmark (reuse-after-RCU-GP + per-cpu reusable list + multiple reuse_rcu() callbacks): | name | loop (k/s)| average memory (MiB)| peak memory (MiB)| | -- | -- | -- | -- | | no_op | 1247.00 | 0.97 | 1.00 | | overwrite | 16.56 | 490.18 | 557.17 | | batch_add_batch_del| 11.31 | 276.32 | 360.89 | | add_del_on_diff_cpu| 4.00 | 24.76 | 42.58 | So it seems the large memory usage is due to irq_work (reuse_bulk) used for free procedure. However after increasing the threshold for invoking irq_work reuse_bulk (e.g., use 10 * c->high_watermark), but there is no big difference in the memory usage and the delayed time for RCU callbacks. Perhaps the reason is that although the number of reuse_bulk irq_work calls is reduced but the time of alloc_bulk() irq_work calls is increased because there are no reusable objects. > > Another problem is the performance degradation compared with immediate > reuse and the output from perf report shown the per-bpf-ma spin-lock is a > top-one hotspot: > > map_perf_test (reuse-after-RCU-GP) > 0:hash_map_perf kmalloc 194677 events per sec > > map_perf_test (immediate reuse) > 2:hash_map_perf kmalloc 384527 events per sec > > Considering the purpose of introducing per-bpf-ma reusable list is to > handle the case in which the allocation and free are done on different > CPUs (e.g., add_del_on_diff_cpu) and a per-cpu reuse list will be enough > for overwrite & batch_add_batch_del cases. So maybe we could implement a > hybrid of global reusable list and per-cpu reusable list and switch > between these two kinds of list according to the history of allocation > and free frequency. > > As ususal, suggestions and comments are always welcome. > > [0]: https://lore.kernel.org/bpf/20230503184841.6mmvdusr3rxiabmu@MacBook-Pro-6.local > [1]: https://lore.kernel.org/bpf/1b64fc4e-d92e-de2f-4895-2e0c36427425@xxxxxxxxxxxxxxx > > Change Log: > v4: > * no kworker (Alexei) > * Use a global reusable list in bpf memory allocator (Alexei) > * Remove BPF_MA_FREE_AFTER_RCU_GP flag and do reuse-after-rcu-gp > defaultly in bpf memory allocator (Alexei) > * add benchmark results from map_perf_test (Alexei) > > v3: https://lore.kernel.org/bpf/20230429101215.111262-1-houtao@xxxxxxxxxxxxxxx/ > * add BPF_MA_FREE_AFTER_RCU_GP bpf memory allocator > * Update htab memory benchmark > * move the benchmark patch to the last patch > * remove array and useless bpf_map_lookup_elem(&array, ...) in bpf > programs > * add synchronization between addition CPU and deletion CPU for > add_del_on_diff_cpu case to prevent unnecessary loop > * add the benchmark result for "extra call_rcu + bpf ma" > > v2: https://lore.kernel.org/bpf/20230408141846.1878768-1-houtao@xxxxxxxxxxxxxxx/ > * add a benchmark for bpf memory allocator to compare between different > flavor of bpf memory allocator. > * implement BPF_MA_REUSE_AFTER_RCU_GP for bpf memory allocator. > > v1: https://lore.kernel.org/bpf/20221230041151.1231169-1-houtao@xxxxxxxxxxxxxxx/ > > Hou Tao (3): > bpf: Factor out a common helper free_all() > selftests/bpf: Add benchmark for bpf memory allocator > bpf: Only reuse after one RCU GP in bpf memory allocator > > include/linux/bpf_mem_alloc.h | 4 + > kernel/bpf/memalloc.c | 385 ++++++++++++------ > tools/testing/selftests/bpf/Makefile | 3 + > tools/testing/selftests/bpf/bench.c | 4 + > .../selftests/bpf/benchs/bench_htab_mem.c | 352 ++++++++++++++++ > .../bpf/benchs/run_bench_htab_mem.sh | 42 ++ > .../selftests/bpf/progs/htab_mem_bench.c | 135 ++++++ > 7 files changed, 809 insertions(+), 116 deletions(-) > create mode 100644 tools/testing/selftests/bpf/benchs/bench_htab_mem.c > create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_htab_mem.sh > create mode 100644 tools/testing/selftests/bpf/progs/htab_mem_bench.c >