Re: [RFC PATCH bpf-next v4 0/3] Handle immediate reuse in bpf memory allocator

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Tue, 6 Jun 2023 14:04:29 -0700

On Tue, Jun 06, 2023 at 08:30:58PM +0800, Hou Tao wrote:
> Hi,
> 
> On 6/6/2023 11:53 AM, Hou Tao wrote:
> > From: Hou Tao <houtao1@xxxxxxxxxx>
> >
> > Hi,
> >
> > The implementation of v4 is mainly based on suggestions from Alexi [0].
> > There are still pending problems for the current implementation as shown
> > in the benchmark result in patch #3, but there was a long time from the
> > posting of v3, so posting v4 here for further disscussions and more
> > suggestions.
> >
> > The first problem is the huge memory usage compared with bpf memory
> > allocator which does immediate reuse:
> >
> > htab-mem-benchmark (reuse-after-RCU-GP):
> > | name               | loop (k/s)| average memory (MiB)| peak memory (MiB)|
> > | --                 | --        | --                  | --               |
> > | no_op              | 1159.18   | 0.99                | 0.99             |
> > | overwrite          | 11.00     | 2288                | 4109             |
> > | batch_add_batch_del| 8.86      | 1558                | 2763             |
> > | add_del_on_diff_cpu| 4.74      | 11.39               | 14.77            |
> >
> > htab-mem-benchmark (immediate-reuse):
> > | name               | loop (k/s)| average memory (MiB)| peak memory (MiB)|
> > | --                 | --        | --                  | --               |
> > | no_op              | 1160.66   | 0.99                | 1.00             |
> > | overwrite          | 28.52     | 2.46                | 2.73             |
> > | batch_add_batch_del| 11.50     | 2.69                | 2.95             |
> > | add_del_on_diff_cpu| 3.75      | 15.85               | 24.24            |
> >
> > It seems the direct reason is the slow RCU grace period. During
> > benchmark, the elapsed time when reuse_rcu() callback is called is about
> > 100ms or even more (e.g., 2 seconds). I suspect the global per-bpf-ma
> > spin-lock and the irq-work running in the contex of freeing process will
> > increase the running overhead of bpf program, the running time of
> > getpgid() is increased, the contex switch is slowed down and the RCU
> > grace period increases [1], but I am still diggin into it.
> For reuse-after-RCU-GP flavor, by removing per-bpf-ma reusable list
> (namely bpf_mem_shared_cache) and using per-cpu reusable list (like v3
> did) instead, the memory usage of htab-mem-benchmark will decrease a lot:
> 
> htab-mem-benchmark (reuse-after-RCU-GP + per-cpu reusable list):
> | name               | loop (k/s)| average memory (MiB)| peak memory (MiB)|
> | --                 | --        | --                  | --               |
> | no_op              | 1165.38   | 0.97                | 1.00             |
> | overwrite          | 17.25     | 626.41              | 781.82           |
> | batch_add_batch_del| 11.51     | 398.56              | 500.29           |
> | add_del_on_diff_cpu| 4.21      | 31.06               | 48.84            |
> 
> But the memory usage is still large compared with v3 and the elapsed
> time of reuse_rcu() callback is about 90~200ms. Compared with v3, there
> are still two differences:
> 1) v3 uses kmalloc() to allocate multiple inflight RCU callbacks to
> accelerate the reuse of freed objects.
> 2) v3 uses kworker instead of irq_work for free procedure.
> 
> For 1), after using kmalloc() in irq_work to allocate multiple inflight
> RCU callbacks (namely reuse_rcu()), the memory usage decreases a bit,
> but is not enough:
> 
> htab-mem-benchmark (reuse-after-RCU-GP + per-cpu reusable list + multiple reuse_rcu() callbacks):
> | name               | loop (k/s)| average memory (MiB)| peak memory (MiB)|
> | --                 | --        | --                  | --               |
> | no_op              | 1247.00   | 0.97                | 1.00             |
> | overwrite          | 16.56     | 490.18              | 557.17           |
> | batch_add_batch_del| 11.31     | 276.32              | 360.89           |
> | add_del_on_diff_cpu| 4.00      | 24.76               | 42.58            |
> 
> So it seems the large memory usage is due to irq_work (reuse_bulk) used
> for free procedure. However after increasing the threshold for invoking
> irq_work reuse_bulk (e.g., use 10 * c->high_watermark), but there is no
> big difference in the memory usage and the delayed time for RCU
> callbacks. Perhaps the reason is that although the number of  reuse_bulk
> irq_work calls is reduced but the time of alloc_bulk() irq_work calls is
> increased because there are no reusable objects.

The large memory usage is because the benchmark in patch 2 is abusing it.
It's doing one bpf_loop() over 16k elements (in case of 1 producer)
and 16k/8 loops for --producers=8.
That's 2k memory allocations that have to wait for RCU GP.
Of course that's a ton of memory.

As far as implementation in patch 3 please respin it asap and remove *_tail optimization.
It makes the code hard to read and doesn't buy us anything.
Other than that the algorithm looks fine.

> > Another problem is the performance degradation compared with immediate
> > reuse and the output from perf report shown the per-bpf-ma spin-lock is a
> > top-one hotspot:

That's not what I see.
Hot spin_lock is in generic htab code. Not it ma.
I still believe per-bpf-ma spin-lock is fine.
The bench in patch 2 is measuring something that no real bpf prog cares about.

See how map_perf_test is doing:
        for (i = 0; i < 10; i++) {
                bpf_map_update_elem(&hash_map_alloc, &key, &init_val, BPF_ANY);

Even 10 map updates for the same map in a single bpf prog invocation is not realistic.
16k/8 is beyond any normal scenario.
There is no reason to optimize bpf_ma for the case of htab abuse.

> > map_perf_test (reuse-after-RCU-GP)
> > 0:hash_map_perf kmalloc 194677 events per sec
> >
> > map_perf_test (immediate reuse)
> > 2:hash_map_perf kmalloc 384527 events per sec

For some reason I cannot reproduce the slow down with map_perf_test 4 8.
I see the same perf with/without patch 3.

I've applied patch 1.
Please respin with patch 2 doing no more than 10 map_updates under rcu lock
and remove *_tail optimization from patch 3.
Just do llist_for_each_safe() when you move elements from one list to another.
And let's brainstorm further.
Please do not delay.