Re: [RFC PATCH bpf-next v4 0/3] Handle immediate reuse in bpf memory allocator

Hou Tao <houtao@xxxxxxxxxxxxxxx> · Tue, 6 Jun 2023 20:30:58 +0800

Hi,

On 6/6/2023 11:53 AM, Hou Tao wrote:
> From: Hou Tao <houtao1@xxxxxxxxxx>
>
> Hi,
>
> The implementation of v4 is mainly based on suggestions from Alexi [0].
> There are still pending problems for the current implementation as shown
> in the benchmark result in patch #3, but there was a long time from the
> posting of v3, so posting v4 here for further disscussions and more
> suggestions.
>
> The first problem is the huge memory usage compared with bpf memory
> allocator which does immediate reuse:
>
> htab-mem-benchmark (reuse-after-RCU-GP):
> | name               | loop (k/s)| average memory (MiB)| peak memory (MiB)|
> | --                 | --        | --                  | --               |
> | no_op              | 1159.18   | 0.99                | 0.99             |
> | overwrite          | 11.00     | 2288                | 4109             |
> | batch_add_batch_del| 8.86      | 1558                | 2763             |
> | add_del_on_diff_cpu| 4.74      | 11.39               | 14.77            |
>
> htab-mem-benchmark (immediate-reuse):
> | name               | loop (k/s)| average memory (MiB)| peak memory (MiB)|
> | --                 | --        | --                  | --               |
> | no_op              | 1160.66   | 0.99                | 1.00             |
> | overwrite          | 28.52     | 2.46                | 2.73             |
> | batch_add_batch_del| 11.50     | 2.69                | 2.95             |
> | add_del_on_diff_cpu| 3.75      | 15.85               | 24.24            |
>
> It seems the direct reason is the slow RCU grace period. During
> benchmark, the elapsed time when reuse_rcu() callback is called is about
> 100ms or even more (e.g., 2 seconds). I suspect the global per-bpf-ma
> spin-lock and the irq-work running in the contex of freeing process will
> increase the running overhead of bpf program, the running time of
> getpgid() is increased, the contex switch is slowed down and the RCU
> grace period increases [1], but I am still diggin into it.
For reuse-after-RCU-GP flavor, by removing per-bpf-ma reusable list
(namely bpf_mem_shared_cache) and using per-cpu reusable list (like v3
did) instead, the memory usage of htab-mem-benchmark will decrease a lot:

htab-mem-benchmark (reuse-after-RCU-GP + per-cpu reusable list):
| name               | loop (k/s)| average memory (MiB)| peak memory (MiB)|
| --                 | --        | --                  | --               |
| no_op              | 1165.38   | 0.97                | 1.00             |
| overwrite          | 17.25     | 626.41              | 781.82           |
| batch_add_batch_del| 11.51     | 398.56              | 500.29           |
| add_del_on_diff_cpu| 4.21      | 31.06               | 48.84            |

But the memory usage is still large compared with v3 and the elapsed
time of reuse_rcu() callback is about 90~200ms. Compared with v3, there
are still two differences:
1) v3 uses kmalloc() to allocate multiple inflight RCU callbacks to
accelerate the reuse of freed objects.
2) v3 uses kworker instead of irq_work for free procedure.

For 1), after using kmalloc() in irq_work to allocate multiple inflight
RCU callbacks (namely reuse_rcu()), the memory usage decreases a bit,
but is not enough:

htab-mem-benchmark (reuse-after-RCU-GP + per-cpu reusable list + multiple reuse_rcu() callbacks):
| name               | loop (k/s)| average memory (MiB)| peak memory (MiB)|
| --                 | --        | --                  | --               |
| no_op              | 1247.00   | 0.97                | 1.00             |
| overwrite          | 16.56     | 490.18              | 557.17           |
| batch_add_batch_del| 11.31     | 276.32              | 360.89           |
| add_del_on_diff_cpu| 4.00      | 24.76               | 42.58            |

So it seems the large memory usage is due to irq_work (reuse_bulk) used
for free procedure. However after increasing the threshold for invoking
irq_work reuse_bulk (e.g., use 10 * c->high_watermark), but there is no
big difference in the memory usage and the delayed time for RCU
callbacks. Perhaps the reason is that although the number of  reuse_bulk
irq_work calls is reduced but the time of alloc_bulk() irq_work calls is
increased because there are no reusable objects.

>
> Another problem is the performance degradation compared with immediate
> reuse and the output from perf report shown the per-bpf-ma spin-lock is a
> top-one hotspot:
>
> map_perf_test (reuse-after-RCU-GP)
> 0:hash_map_perf kmalloc 194677 events per sec
>
> map_perf_test (immediate reuse)
> 2:hash_map_perf kmalloc 384527 events per sec
>
> Considering the purpose of introducing per-bpf-ma reusable list is to
> handle the case in which the allocation and free are done on different
> CPUs (e.g., add_del_on_diff_cpu) and a per-cpu reuse list will be enough
> for overwrite & batch_add_batch_del cases. So maybe we could implement a
> hybrid of global reusable list and per-cpu reusable list and switch
> between these two kinds of list according to the history of allocation
> and free frequency.
>
> As ususal, suggestions and comments are always welcome.
>
> [0]: https://lore.kernel.org/bpf/20230503184841.6mmvdusr3rxiabmu@MacBook-Pro-6.local
> [1]: https://lore.kernel.org/bpf/1b64fc4e-d92e-de2f-4895-2e0c36427425@xxxxxxxxxxxxxxx
>
> Change Log:
> v4:
>  * no kworker (Alexei)
>  * Use a global reusable list in bpf memory allocator (Alexei)
>  * Remove BPF_MA_FREE_AFTER_RCU_GP flag and do reuse-after-rcu-gp
>    defaultly in bpf memory allocator (Alexei)
>  * add benchmark results from map_perf_test (Alexei)
>
> v3: https://lore.kernel.org/bpf/20230429101215.111262-1-houtao@xxxxxxxxxxxxxxx/
>  * add BPF_MA_FREE_AFTER_RCU_GP bpf memory allocator
>  * Update htab memory benchmark
>    * move the benchmark patch to the last patch
>    * remove array and useless bpf_map_lookup_elem(&array, ...) in bpf
>      programs
>    * add synchronization between addition CPU and deletion CPU for
>      add_del_on_diff_cpu case to prevent unnecessary loop
>    * add the benchmark result for "extra call_rcu + bpf ma"
>
> v2: https://lore.kernel.org/bpf/20230408141846.1878768-1-houtao@xxxxxxxxxxxxxxx/
>  * add a benchmark for bpf memory allocator to compare between different
>    flavor of bpf memory allocator.
>  * implement BPF_MA_REUSE_AFTER_RCU_GP for bpf memory allocator.
>
> v1: https://lore.kernel.org/bpf/20221230041151.1231169-1-houtao@xxxxxxxxxxxxxxx/
>
> Hou Tao (3):
>   bpf: Factor out a common helper free_all()
>   selftests/bpf: Add benchmark for bpf memory allocator
>   bpf: Only reuse after one RCU GP in bpf memory allocator
>
>  include/linux/bpf_mem_alloc.h                 |   4 +
>  kernel/bpf/memalloc.c                         | 385 ++++++++++++------
>  tools/testing/selftests/bpf/Makefile          |   3 +
>  tools/testing/selftests/bpf/bench.c           |   4 +
>  .../selftests/bpf/benchs/bench_htab_mem.c     | 352 ++++++++++++++++
>  .../bpf/benchs/run_bench_htab_mem.sh          |  42 ++
>  .../selftests/bpf/progs/htab_mem_bench.c      | 135 ++++++
>  7 files changed, 809 insertions(+), 116 deletions(-)
>  create mode 100644 tools/testing/selftests/bpf/benchs/bench_htab_mem.c
>  create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_htab_mem.sh
>  create mode 100644 tools/testing/selftests/bpf/progs/htab_mem_bench.c
>