Re: [RFC PATCH bpf-next v4 0/3] Handle immediate reuse in bpf memory allocator

Hou Tao <houtao@xxxxxxxxxxxxxxx> · Wed, 7 Jun 2023 16:42:11 +0800

Hi,

On 6/7/2023 5:04 AM, Alexei Starovoitov wrote:
> On Tue, Jun 06, 2023 at 08:30:58PM +0800, Hou Tao wrote:
>> Hi,
>>
>> On 6/6/2023 11:53 AM, Hou Tao wrote:
>>> From: Hou Tao <houtao1@xxxxxxxxxx>
>>>
>>> Hi,
>>>
>>> The implementation of v4 is mainly based on suggestions from Alexi [0].
>>> There are still pending problems for the current implementation as shown
>>> in the benchmark result in patch #3, but there was a long time from the
>>> posting of v3, so posting v4 here for further disscussions and more
>>> suggestions.
>>>
>>> The first problem is the huge memory usage compared with bpf memory
>>> allocator which does immediate reuse:
SNIP
> The large memory usage is because the benchmark in patch 2 is abusing it.
> It's doing one bpf_loop() over 16k elements (in case of 1 producer)
> and 16k/8 loops for --producers=8.
> That's 2k memory allocations that have to wait for RCU GP.
> Of course that's a ton of memory.
>
> As far as implementation in patch 3 please respin it asap and remove *_tail optimization.
> It makes the code hard to read and doesn't buy us anything.
> Other than that the algorithm looks fine.
>
>>> Another problem is the performance degradation compared with immediate
>>> reuse and the output from perf report shown the per-bpf-ma spin-lock is a
>>> top-one hotspot:
> That's not what I see.
> Hot spin_lock is in generic htab code. Not it ma.
> I still believe per-bpf-ma spin-lock is fine.
> The bench in patch 2 is measuring something that no real bpf prog cares about.
>
> See how map_perf_test is doing:
>         for (i = 0; i < 10; i++) {
>                 bpf_map_update_elem(&hash_map_alloc, &key, &init_val, BPF_ANY);
>
> Even 10 map updates for the same map in a single bpf prog invocation is not realistic.
> 16k/8 is beyond any normal scenario.
> There is no reason to optimize bpf_ma for the case of htab abuse.
>
>>> map_perf_test (reuse-after-RCU-GP)
>>> 0:hash_map_perf kmalloc 194677 events per sec
>>>
>>> map_perf_test (immediate reuse)
>>> 2:hash_map_perf kmalloc 384527 events per sec
> For some reason I cannot reproduce the slow down with map_perf_test 4 8.
> I see the same perf with/without patch 3.
As said in the commit message, the command line for test is
"./map_perf_test 4 8 16384", because the default max_entries is 1000. If
using default max_entries and the number of CPUs is greater than 15,
use_percpu_counter will be false.

I have double checked my local VM setup (8 CPUs + 16GB) and rerun the
test.  For both "./map_perf_test 4 8" and "./map_perf_test 4 8 16384"
there are obvious performance degradation.

Before reuse-after-rcu-gp:
[root@hello bpf]# ./map_perf_test 4 8
5:hash_map_perf kmalloc 358498 events per sec
4:hash_map_perf kmalloc 306351 events per sec
1:hash_map_perf kmalloc 299175 events per sec
3:hash_map_perf kmalloc 297689 events per sec
6:hash_map_perf kmalloc 299839 events per sec
2:hash_map_perf kmalloc 292286 events per sec
7:hash_map_perf kmalloc 278138 events per sec
0:hash_map_perf kmalloc 265031 events per sec

[root@hello bpf]# ./map_perf_test 4 8 16384
2:hash_map_perf kmalloc 359201 events per sec
6:hash_map_perf kmalloc 302475 events per sec
3:hash_map_perf kmalloc 298583 events per sec
7:hash_map_perf kmalloc 301594 events per sec
0:hash_map_perf kmalloc 295210 events per sec
4:hash_map_perf kmalloc 230496 events per sec
5:hash_map_perf kmalloc 163530 events per sec
1:hash_map_perf kmalloc 153520 events per sec

After reuse-after-rcu-gp:
[root@hello bpf]# ./map_perf_test 4 8
1:hash_map_perf kmalloc 203252 events per sec
2:hash_map_perf kmalloc 181777 events per sec
6:hash_map_perf kmalloc 183467 events per sec
4:hash_map_perf kmalloc 182590 events per sec
0:hash_map_perf kmalloc 180840 events per sec
3:hash_map_perf kmalloc 179875 events per sec
5:hash_map_perf kmalloc 152250 events per sec
7:hash_map_perf kmalloc 137500 events per sec

[root@hello bpf]# ./map_perf_test 4 8 16384
4:hash_map_perf kmalloc 203983 events per sec
5:hash_map_perf kmalloc 197902 events per sec
2:hash_map_perf kmalloc 184693 events per sec
3:hash_map_perf kmalloc 185341 events per sec
1:hash_map_perf kmalloc 183064 events per sec
7:hash_map_perf kmalloc 181148 events per sec
0:hash_map_perf kmalloc 178520 events per sec
6:hash_map_perf kmalloc 179340 events per sec

I also run map_perf_test on a physical x86-64 host with 72 CPUs. The
performances for "./map_perf_test 4 8" are similar, but there is obvious
performance degradation for "./map_perf_test 4 8 16384"

Before reuse-after-rcu-gp:

[houtao@fedora bpf]$ sudo ./map_perf_test 4 8
1:hash_map_perf kmalloc 41711 events per sec
3:hash_map_perf kmalloc 41352 events per sec
4:hash_map_perf kmalloc 41352 events per sec
0:hash_map_perf kmalloc 41008 events per sec
7:hash_map_perf kmalloc 41086 events per sec
5:hash_map_perf kmalloc 41038 events per sec
2:hash_map_perf kmalloc 40971 events per sec
6:hash_map_perf kmalloc 41008 events per sec

[houtao@fedora bpf]$ sudo ./map_perf_test 4 8 16384
1:hash_map_perf kmalloc 388088 events per sec
7:hash_map_perf kmalloc 391843 events per sec
6:hash_map_perf kmalloc 387901 events per sec
3:hash_map_perf kmalloc 356096 events per sec
4:hash_map_perf kmalloc 356836 events per sec
2:hash_map_perf kmalloc 349728 events per sec
0:hash_map_perf kmalloc 345792 events per sec
5:hash_map_perf kmalloc 346742 events per sec

After reuse-after-rcu-gp:

[houtao@fedora bpf]$ sudo ./map_perf_test 4 8
4:hash_map_perf kmalloc 42667 events per sec
1:hash_map_perf kmalloc 42206 events per sec
5:hash_map_perf kmalloc 42264 events per sec
6:hash_map_perf kmalloc 42196 events per sec
7:hash_map_perf kmalloc 42142 events per sec
2:hash_map_perf kmalloc 42028 events per sec
0:hash_map_perf kmalloc 41933 events per sec
3:hash_map_perf kmalloc 41986 events per sec

[houtao@fedora bpf]$ sudo ./map_perf_test 4 8 16384
5:hash_map_perf kmalloc 655628 events per sec
7:hash_map_perf kmalloc 650691 events per sec
2:hash_map_perf kmalloc 624384 events per sec
1:hash_map_perf kmalloc 615011 events per sec
3:hash_map_perf kmalloc 618335 events per sec
4:hash_map_perf kmalloc 624072 events per sec
6:hash_map_perf kmalloc 628559 events per sec
0:hash_map_perf kmalloc 585384 events per sec

So could you please double check your setup and rerun map_perf_test ? If
there is no performance degradation, could you please share your setup
and your kernel configure file ?
>
> I've applied patch 1.
> Please respin with patch 2 doing no more than 10 map_updates under rcu lock
> and remove *_tail optimization from patch 3.
> Just do llist_for_each_safe() when you move elements from one list to another.
> And let's brainstorm further.
> Please do not delay.