Re: [PATCH v4 bpf-next 12/14] bpf: Introduce bpf_mem_free_rcu() similar to kfree_rcu().

Hou Tao <houtao@xxxxxxxxxxxxxxx> · Sat, 8 Jul 2023 15:00:25 +0800



Hi,

On 7/7/2023 12:05 PM, Hou Tao wrote:
> Hi,
>
> On 7/7/2023 10:10 AM, Alexei Starovoitov wrote:
>> On Thu, Jul 6, 2023 at 6:45 PM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote:
>>>
>>> On 7/6/2023 11:34 AM, Alexei Starovoitov wrote:
>>>> From: Alexei Starovoitov <ast@xxxxxxxxxx>
>>>>
>>>> Introduce bpf_mem_[cache_]free_rcu() similar to kfree_rcu().
>>>> Unlike bpf_mem_[cache_]free() that links objects for immediate reuse into
>>>> per-cpu free list the _rcu() flavor waits for RCU grace period and then moves
>>>> objects into free_by_rcu_ttrace list where they are waiting for RCU
>>>> task trace grace period to be freed into slab.
>>>>
>>>> The life cycle of objects:
>>>> alloc: dequeue free_llist
>>>> free: enqeueu free_llist
>>>> free_rcu: enqueue free_by_rcu -> waiting_for_gp
>>>> free_llist above high watermark -> free_by_rcu_ttrace
>>>> after RCU GP waiting_for_gp -> free_by_rcu_ttrace
>>>> free_by_rcu_ttrace -> waiting_for_gp_ttrace -> slab
>>>>
>>>> Signed-off-by: Alexei Starovoitov <ast@xxxxxxxxxx>
>>> Acked-by: Hou Tao <houtao1@xxxxxxxxxx>
>> Thank you very much for code reviews and feedback.
> You are welcome. I also learn a lot from this great patch set.
>
>> btw I still believe that ABA is a non issue and prefer to keep the code as-is,
>> but for the sake of experiment I've converted it to spin_lock
>> (see attached patch which I think uglifies the code)
>> and performance across bench htab-mem and map_perf_test
>> seems to be about the same.
>> Which was a bit surprising to me.
>> Could you please benchmark it on your system?
> Will do that later. It seems if there is no cross-CPU allocation and
> free, the only possible contention is between __free_rcu() on CPU x and
> alloc_bulk()/free_bulk() on a different CPU.
>
For my local VM setup, the spin-lock also doesn't make much different
under both htab-mem and map_perf_test as shown below.

without spin-lock

normal bpf ma
=============
overwrite            per-prod-op: 54.16 ± 0.79k/s, avg mem: 159.99 ±
40.80MiB, peak mem: 251.41MiB
batch_add_batch_del  per-prod-op: 83.87 ± 0.86k/s, avg mem: 70.52 ±
22.73MiB, peak mem: 121.31MiB
add_del_on_diff_cpu  per-prod-op: 25.98 ± 0.13k/s, avg mem: 17.88 ±
1.84MiB, peak mem: 22.86MiB

./map_perf_test 4 8 16384
0:hash_map_perf kmalloc 361532 events per sec
2:hash_map_perf kmalloc 352594 events per sec
6:hash_map_perf kmalloc 356007 events per sec
5:hash_map_perf kmalloc 354184 events per sec
3:hash_map_perf kmalloc 348720 events per sec
1:hash_map_perf kmalloc 346332 events per sec
7:hash_map_perf kmalloc 352126 events per sec
4:hash_map_perf kmalloc 339459 events per sec

with spin-lock

normal bpf ma
=============
overwrite            per-prod-op: 54.72 ± 0.96k/s, avg mem: 133.99 ±
34.04MiB, peak mem: 221.60MiB
batch_add_batch_del  per-prod-op: 82.90 ± 1.86k/s, avg mem: 55.91 ±
11.05MiB, peak mem: 103.82MiB
add_del_on_diff_cpu  per-prod-op: 26.75 ± 0.10k/s, avg mem: 18.55 ±
1.24MiB, peak mem: 23.11MiB

./map_perf_test 4 8 16384
1:hash_map_perf kmalloc 361750 events per sec
2:hash_map_perf kmalloc 360976 events per sec
6:hash_map_perf kmalloc 361745 events per sec
0:hash_map_perf kmalloc 350349 events per sec
7:hash_map_perf kmalloc 359125 events per sec
3:hash_map_perf kmalloc 352683 events per sec
5:hash_map_perf kmalloc 350897 events per sec
4:hash_map_perf kmalloc 331215 events per sec