Re: [RFC PATCH bpf-next v4 0/3] Handle immediate reuse in bpf memory allocator

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Tue, 6 Jun 2023 18:39:23 -0700

On Tue, Jun 6, 2023 at 6:19 PM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote:
>
> Hi,
>
> On 6/7/2023 5:04 AM, Alexei Starovoitov wrote:
> > On Tue, Jun 06, 2023 at 08:30:58PM +0800, Hou Tao wrote:
> >> Hi,
> >>
> >> On 6/6/2023 11:53 AM, Hou Tao wrote:
> >>> From: Hou Tao <houtao1@xxxxxxxxxx>
> >>>
> >>> Hi,
> >>>
> >>> The implementation of v4 is mainly based on suggestions from Alexi [0].
> >>> There are still pending problems for the current implementation as shown
> >>> in the benchmark result in patch #3, but there was a long time from the
> >>> posting of v3, so posting v4 here for further disscussions and more
> >>> suggestions.
> >>>
> >>> The first problem is the huge memory usage compared with bpf memory
> >>> allocator which does immediate reuse:
> >>>
> >>> htab-mem-benchmark (reuse-after-RCU-GP):
> >>> | name               | loop (k/s)| average memory (MiB)| peak memory (MiB)|
> >>> | --                 | --        | --                  | --               |
> >>> | no_op              | 1159.18   | 0.99                | 0.99             |
> >>> | overwrite          | 11.00     | 2288                | 4109             |
> >>> | batch_add_batch_del| 8.86      | 1558                | 2763             |
> >>> | add_del_on_diff_cpu| 4.74      | 11.39               | 14.77            |
> >>>
> >>> htab-mem-benchmark (immediate-reuse):
> >>> | name               | loop (k/s)| average memory (MiB)| peak memory (MiB)|
> >>> | --                 | --        | --                  | --               |
> >>> | no_op              | 1160.66   | 0.99                | 1.00             |
> >>> | overwrite          | 28.52     | 2.46                | 2.73             |
> >>> | batch_add_batch_del| 11.50     | 2.69                | 2.95             |
> >>> | add_del_on_diff_cpu| 3.75      | 15.85               | 24.24            |
> >>>
> >>> It seems the direct reason is the slow RCU grace period. During
> >>> benchmark, the elapsed time when reuse_rcu() callback is called is about
> >>> 100ms or even more (e.g., 2 seconds). I suspect the global per-bpf-ma
> >>> spin-lock and the irq-work running in the contex of freeing process will
> >>> increase the running overhead of bpf program, the running time of
> >>> getpgid() is increased, the contex switch is slowed down and the RCU
> >>> grace period increases [1], but I am still diggin into it.
> >> For reuse-after-RCU-GP flavor, by removing per-bpf-ma reusable list
> >> (namely bpf_mem_shared_cache) and using per-cpu reusable list (like v3
> >> did) instead, the memory usage of htab-mem-benchmark will decrease a lot:
> >>
> >> htab-mem-benchmark (reuse-after-RCU-GP + per-cpu reusable list):
> >> | name               | loop (k/s)| average memory (MiB)| peak memory (MiB)|
> >> | --                 | --        | --                  | --               |
> >> | no_op              | 1165.38   | 0.97                | 1.00             |
> >> | overwrite          | 17.25     | 626.41              | 781.82           |
> >> | batch_add_batch_del| 11.51     | 398.56              | 500.29           |
> >> | add_del_on_diff_cpu| 4.21      | 31.06               | 48.84            |
> >>
> >> But the memory usage is still large compared with v3 and the elapsed
> >> time of reuse_rcu() callback is about 90~200ms. Compared with v3, there
> >> are still two differences:
> >> 1) v3 uses kmalloc() to allocate multiple inflight RCU callbacks to
> >> accelerate the reuse of freed objects.
> >> 2) v3 uses kworker instead of irq_work for free procedure.
> >>
> >> For 1), after using kmalloc() in irq_work to allocate multiple inflight
> >> RCU callbacks (namely reuse_rcu()), the memory usage decreases a bit,
> >> but is not enough:
> >>
> >> htab-mem-benchmark (reuse-after-RCU-GP + per-cpu reusable list + multiple reuse_rcu() callbacks):
> >> | name               | loop (k/s)| average memory (MiB)| peak memory (MiB)|
> >> | --                 | --        | --                  | --               |
> >> | no_op              | 1247.00   | 0.97                | 1.00             |
> >> | overwrite          | 16.56     | 490.18              | 557.17           |
> >> | batch_add_batch_del| 11.31     | 276.32              | 360.89           |
> >> | add_del_on_diff_cpu| 4.00      | 24.76               | 42.58            |
> >>
> >> So it seems the large memory usage is due to irq_work (reuse_bulk) used
> >> for free procedure. However after increasing the threshold for invoking
> >> irq_work reuse_bulk (e.g., use 10 * c->high_watermark), but there is no
> >> big difference in the memory usage and the delayed time for RCU
> >> callbacks. Perhaps the reason is that although the number of  reuse_bulk
> >> irq_work calls is reduced but the time of alloc_bulk() irq_work calls is
> >> increased because there are no reusable objects.
> > The large memory usage is because the benchmark in patch 2 is abusing it.
> > It's doing one bpf_loop() over 16k elements (in case of 1 producer)
> > and 16k/8 loops for --producers=8.
> > That's 2k memory allocations that have to wait for RCU GP.
> > Of course that's a ton of memory.
> I don't agree that. Because in v3, the benchmark is the same, but both
> the performance and the memory usage are better than v4. Even compared
> with  "htab-mem-benchmark (reuse-after-RCU-GP + per-cpu reusable list +
> multiple reuse_rcu() callbacks)" above, the memory usage in v3 is still
> much smaller as shown below. If the large memory usage is due to the
> abuse in benchmark, how do you explain the memory usage in v3 ?

There could have been implementation bugs or whatever else.
The main point is the bench test is not realistic and should not be
used to make design decisions.

> The reason I added tail for each list is that there could be thousands
> even ten thousands elements in these lists and there is no need to spend
> CPU time to traversal these list one by one. It maybe a premature
> optimization. So let me remove tails from these list first and I will
> try to add these tails back later and check whether or not there is any
> performance improvement.

There will be thousands of elements only because the bench test is wrong.
It's doing something no real prog would do.

> I have a different view for the benchmark. Firstly htab is not the only
> user of bpf memory allocator, secondly we can't predict the exact
> behavior of bpf programs, so I think to stress bpf memory allocator for
> various kinds of use case is good for its broad usage.

It is not a stress test. It's an abuse.
A stress test would be something that can happen in practice.
Doing thousands map_updates in a forever loop is not something
useful code would do.
For example call_rcu_tasks_trace is not design to be called millions
times a second. It's an anti-pattern and rcu core won't be optimized to do so.
rcu, srcu, rcu_task_trace have different usage patterns.
The programmer has to correctly pick one depending on the use case.
Same with bpf htab. If somebody has a real need to do thousands
updates under rcu lock they should be using preallocated map and deal
with immediate reuse.