Re: [RFC PATCH bpf-next v4 0/3] Handle immediate reuse in bpf memory allocator

Hou Tao <houtao@xxxxxxxxxxxxxxx> · Wed, 7 Jun 2023 09:19:29 +0800



Hi,

On 6/7/2023 5:04 AM, Alexei Starovoitov wrote:
> On Tue, Jun 06, 2023 at 08:30:58PM +0800, Hou Tao wrote:
>> Hi,
>>
>> On 6/6/2023 11:53 AM, Hou Tao wrote:
>>> From: Hou Tao <houtao1@xxxxxxxxxx>
>>>
>>> Hi,
>>>
>>> The implementation of v4 is mainly based on suggestions from Alexi [0].
>>> There are still pending problems for the current implementation as shown
>>> in the benchmark result in patch #3, but there was a long time from the
>>> posting of v3, so posting v4 here for further disscussions and more
>>> suggestions.
>>>
>>> The first problem is the huge memory usage compared with bpf memory
>>> allocator which does immediate reuse:
>>>
>>> htab-mem-benchmark (reuse-after-RCU-GP):
>>> | name               | loop (k/s)| average memory (MiB)| peak memory (MiB)|
>>> | --                 | --        | --                  | --               |
>>> | no_op              | 1159.18   | 0.99                | 0.99             |
>>> | overwrite          | 11.00     | 2288                | 4109             |
>>> | batch_add_batch_del| 8.86      | 1558                | 2763             |
>>> | add_del_on_diff_cpu| 4.74      | 11.39               | 14.77            |
>>>
>>> htab-mem-benchmark (immediate-reuse):
>>> | name               | loop (k/s)| average memory (MiB)| peak memory (MiB)|
>>> | --                 | --        | --                  | --               |
>>> | no_op              | 1160.66   | 0.99                | 1.00             |
>>> | overwrite          | 28.52     | 2.46                | 2.73             |
>>> | batch_add_batch_del| 11.50     | 2.69                | 2.95             |
>>> | add_del_on_diff_cpu| 3.75      | 15.85               | 24.24            |
>>>
>>> It seems the direct reason is the slow RCU grace period. During
>>> benchmark, the elapsed time when reuse_rcu() callback is called is about
>>> 100ms or even more (e.g., 2 seconds). I suspect the global per-bpf-ma
>>> spin-lock and the irq-work running in the contex of freeing process will
>>> increase the running overhead of bpf program, the running time of
>>> getpgid() is increased, the contex switch is slowed down and the RCU
>>> grace period increases [1], but I am still diggin into it.
>> For reuse-after-RCU-GP flavor, by removing per-bpf-ma reusable list
>> (namely bpf_mem_shared_cache) and using per-cpu reusable list (like v3
>> did) instead, the memory usage of htab-mem-benchmark will decrease a lot:
>>
>> htab-mem-benchmark (reuse-after-RCU-GP + per-cpu reusable list):
>> | name               | loop (k/s)| average memory (MiB)| peak memory (MiB)|
>> | --                 | --        | --                  | --               |
>> | no_op              | 1165.38   | 0.97                | 1.00             |
>> | overwrite          | 17.25     | 626.41              | 781.82           |
>> | batch_add_batch_del| 11.51     | 398.56              | 500.29           |
>> | add_del_on_diff_cpu| 4.21      | 31.06               | 48.84            |
>>
>> But the memory usage is still large compared with v3 and the elapsed
>> time of reuse_rcu() callback is about 90~200ms. Compared with v3, there
>> are still two differences:
>> 1) v3 uses kmalloc() to allocate multiple inflight RCU callbacks to
>> accelerate the reuse of freed objects.
>> 2) v3 uses kworker instead of irq_work for free procedure.
>>
>> For 1), after using kmalloc() in irq_work to allocate multiple inflight
>> RCU callbacks (namely reuse_rcu()), the memory usage decreases a bit,
>> but is not enough:
>>
>> htab-mem-benchmark (reuse-after-RCU-GP + per-cpu reusable list + multiple reuse_rcu() callbacks):
>> | name               | loop (k/s)| average memory (MiB)| peak memory (MiB)|
>> | --                 | --        | --                  | --               |
>> | no_op              | 1247.00   | 0.97                | 1.00             |
>> | overwrite          | 16.56     | 490.18              | 557.17           |
>> | batch_add_batch_del| 11.31     | 276.32              | 360.89           |
>> | add_del_on_diff_cpu| 4.00      | 24.76               | 42.58            |
>>
>> So it seems the large memory usage is due to irq_work (reuse_bulk) used
>> for free procedure. However after increasing the threshold for invoking
>> irq_work reuse_bulk (e.g., use 10 * c->high_watermark), but there is no
>> big difference in the memory usage and the delayed time for RCU
>> callbacks. Perhaps the reason is that although the number of  reuse_bulk
>> irq_work calls is reduced but the time of alloc_bulk() irq_work calls is
>> increased because there are no reusable objects.
> The large memory usage is because the benchmark in patch 2 is abusing it.
> It's doing one bpf_loop() over 16k elements (in case of 1 producer)
> and 16k/8 loops for --producers=8.
> That's 2k memory allocations that have to wait for RCU GP.
> Of course that's a ton of memory.
I don't agree that. Because in v3, the benchmark is the same, but both
the performance and the memory usage are better than v4. Even compared
with  "htab-mem-benchmark (reuse-after-RCU-GP + per-cpu reusable list +
multiple reuse_rcu() callbacks)" above, the memory usage in v3 is still
much smaller as shown below. If the large memory usage is due to the
abuse in benchmark, how do you explain the memory usage in v3 ?

htab-mem-benchmark (reuse-after-rcu-gp v3)

| name                | loop (k/s) | average memory (MiB) | peak memory (MiB) |
| --                  | --         | --                   | --                |
| no_op               | 1199.16    | 0.97                 | 0.99              |
| overwrite           | 16.37      | 24.01                | 31.76             |
| batch_add_batch_del | 9.61       | 16.71                | 19.95             |
| add_del_on_diff_cpu | 3.62       | 22.93                | 37.02             |

>
> As far as implementation in patch 3 please respin it asap and remove *_tail optimization.
> It makes the code hard to read and doesn't buy us anything.
The reason I added tail for each list is that there could be thousands
even ten thousands elements in these lists and there is no need to spend
CPU time to traversal these list one by one. It maybe a premature
optimization. So let me remove tails from these list first and I will
try to add these tails back later and check whether or not there is any
performance improvement.
> Other than that the algorithm looks fine.
>
>>> Another problem is the performance degradation compared with immediate
>>> reuse and the output from perf report shown the per-bpf-ma spin-lock is a
>>> top-one hotspot:
> That's not what I see.
> Hot spin_lock is in generic htab code. Not it ma.
> I still believe per-bpf-ma spin-lock is fine.
> The bench in patch 2 is measuring something that no real bpf prog cares about.
>
> See how map_perf_test is doing:
>         for (i = 0; i < 10; i++) {
>                 bpf_map_update_elem(&hash_map_alloc, &key, &init_val, BPF_ANY);
>
> Even 10 map updates for the same map in a single bpf prog invocation is not realistic.
> 16k/8 is beyond any normal scenario.
> There is no reason to optimize bpf_ma for the case of htab abuse.
I have a different view for the benchmark. Firstly htab is not the only
user of bpf memory allocator, secondly we can't predict the exact
behavior of bpf programs, so I think to stress bpf memory allocator for
various kinds of use case is good for its broad usage.
>
>>> map_perf_test (reuse-after-RCU-GP)
>>> 0:hash_map_perf kmalloc 194677 events per sec
>>>
>>> map_perf_test (immediate reuse)
>>> 2:hash_map_perf kmalloc 384527 events per sec
> For some reason I cannot reproduce the slow down with map_perf_test 4 8.
> I see the same perf with/without patch 3.
I will double check my local setup and test results.
>
> I've applied patch 1.
> Please respin with patch 2 doing no more than 10 map_updates under rcu lock
> and remove *_tail optimization from patch 3.
> Just do llist_for_each_safe() when you move elements from one list to another.
> And let's brainstorm further.
> Please do not delay.
> .