Hi, On 7/7/2023 12:05 PM, Hou Tao wrote: > Hi, > > On 7/7/2023 10:10 AM, Alexei Starovoitov wrote: >> On Thu, Jul 6, 2023 at 6:45 PM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote: >>> >>> On 7/6/2023 11:34 AM, Alexei Starovoitov wrote: >>>> From: Alexei Starovoitov <ast@xxxxxxxxxx> >>>> >>>> Introduce bpf_mem_[cache_]free_rcu() similar to kfree_rcu(). >>>> Unlike bpf_mem_[cache_]free() that links objects for immediate reuse into >>>> per-cpu free list the _rcu() flavor waits for RCU grace period and then moves >>>> objects into free_by_rcu_ttrace list where they are waiting for RCU >>>> task trace grace period to be freed into slab. >>>> >>>> The life cycle of objects: >>>> alloc: dequeue free_llist >>>> free: enqeueu free_llist >>>> free_rcu: enqueue free_by_rcu -> waiting_for_gp >>>> free_llist above high watermark -> free_by_rcu_ttrace >>>> after RCU GP waiting_for_gp -> free_by_rcu_ttrace >>>> free_by_rcu_ttrace -> waiting_for_gp_ttrace -> slab >>>> >>>> Signed-off-by: Alexei Starovoitov <ast@xxxxxxxxxx> >>> Acked-by: Hou Tao <houtao1@xxxxxxxxxx> >> Thank you very much for code reviews and feedback. > You are welcome. I also learn a lot from this great patch set. > >> btw I still believe that ABA is a non issue and prefer to keep the code as-is, >> but for the sake of experiment I've converted it to spin_lock >> (see attached patch which I think uglifies the code) >> and performance across bench htab-mem and map_perf_test >> seems to be about the same. >> Which was a bit surprising to me. >> Could you please benchmark it on your system? > Will do that later. It seems if there is no cross-CPU allocation and > free, the only possible contention is between __free_rcu() on CPU x and > alloc_bulk()/free_bulk() on a different CPU. > For my local VM setup, the spin-lock also doesn't make much different under both htab-mem and map_perf_test as shown below. without spin-lock normal bpf ma ============= overwrite per-prod-op: 54.16 ± 0.79k/s, avg mem: 159.99 ± 40.80MiB, peak mem: 251.41MiB batch_add_batch_del per-prod-op: 83.87 ± 0.86k/s, avg mem: 70.52 ± 22.73MiB, peak mem: 121.31MiB add_del_on_diff_cpu per-prod-op: 25.98 ± 0.13k/s, avg mem: 17.88 ± 1.84MiB, peak mem: 22.86MiB ./map_perf_test 4 8 16384 0:hash_map_perf kmalloc 361532 events per sec 2:hash_map_perf kmalloc 352594 events per sec 6:hash_map_perf kmalloc 356007 events per sec 5:hash_map_perf kmalloc 354184 events per sec 3:hash_map_perf kmalloc 348720 events per sec 1:hash_map_perf kmalloc 346332 events per sec 7:hash_map_perf kmalloc 352126 events per sec 4:hash_map_perf kmalloc 339459 events per sec with spin-lock normal bpf ma ============= overwrite per-prod-op: 54.72 ± 0.96k/s, avg mem: 133.99 ± 34.04MiB, peak mem: 221.60MiB batch_add_batch_del per-prod-op: 82.90 ± 1.86k/s, avg mem: 55.91 ± 11.05MiB, peak mem: 103.82MiB add_del_on_diff_cpu per-prod-op: 26.75 ± 0.10k/s, avg mem: 18.55 ± 1.24MiB, peak mem: 23.11MiB ./map_perf_test 4 8 16384 1:hash_map_perf kmalloc 361750 events per sec 2:hash_map_perf kmalloc 360976 events per sec 6:hash_map_perf kmalloc 361745 events per sec 0:hash_map_perf kmalloc 350349 events per sec 7:hash_map_perf kmalloc 359125 events per sec 3:hash_map_perf kmalloc 352683 events per sec 5:hash_map_perf kmalloc 350897 events per sec 4:hash_map_perf kmalloc 331215 events per sec