Hi, On 6/7/2023 5:04 AM, Alexei Starovoitov wrote: > On Tue, Jun 06, 2023 at 08:30:58PM +0800, Hou Tao wrote: >> Hi, >> >> On 6/6/2023 11:53 AM, Hou Tao wrote: >>> From: Hou Tao <houtao1@xxxxxxxxxx> >>> >>> Hi, >>> >>> The implementation of v4 is mainly based on suggestions from Alexi [0]. >>> There are still pending problems for the current implementation as shown >>> in the benchmark result in patch #3, but there was a long time from the >>> posting of v3, so posting v4 here for further disscussions and more >>> suggestions. >>> >>> The first problem is the huge memory usage compared with bpf memory >>> allocator which does immediate reuse: SNIP > The large memory usage is because the benchmark in patch 2 is abusing it. > It's doing one bpf_loop() over 16k elements (in case of 1 producer) > and 16k/8 loops for --producers=8. > That's 2k memory allocations that have to wait for RCU GP. > Of course that's a ton of memory. > > As far as implementation in patch 3 please respin it asap and remove *_tail optimization. > It makes the code hard to read and doesn't buy us anything. > Other than that the algorithm looks fine. > >>> Another problem is the performance degradation compared with immediate >>> reuse and the output from perf report shown the per-bpf-ma spin-lock is a >>> top-one hotspot: > That's not what I see. > Hot spin_lock is in generic htab code. Not it ma. > I still believe per-bpf-ma spin-lock is fine. > The bench in patch 2 is measuring something that no real bpf prog cares about. > > See how map_perf_test is doing: > for (i = 0; i < 10; i++) { > bpf_map_update_elem(&hash_map_alloc, &key, &init_val, BPF_ANY); > > Even 10 map updates for the same map in a single bpf prog invocation is not realistic. > 16k/8 is beyond any normal scenario. > There is no reason to optimize bpf_ma for the case of htab abuse. > >>> map_perf_test (reuse-after-RCU-GP) >>> 0:hash_map_perf kmalloc 194677 events per sec >>> >>> map_perf_test (immediate reuse) >>> 2:hash_map_perf kmalloc 384527 events per sec > For some reason I cannot reproduce the slow down with map_perf_test 4 8. > I see the same perf with/without patch 3. As said in the commit message, the command line for test is "./map_perf_test 4 8 16384", because the default max_entries is 1000. If using default max_entries and the number of CPUs is greater than 15, use_percpu_counter will be false. I have double checked my local VM setup (8 CPUs + 16GB) and rerun the test. For both "./map_perf_test 4 8" and "./map_perf_test 4 8 16384" there are obvious performance degradation. Before reuse-after-rcu-gp: [root@hello bpf]# ./map_perf_test 4 8 5:hash_map_perf kmalloc 358498 events per sec 4:hash_map_perf kmalloc 306351 events per sec 1:hash_map_perf kmalloc 299175 events per sec 3:hash_map_perf kmalloc 297689 events per sec 6:hash_map_perf kmalloc 299839 events per sec 2:hash_map_perf kmalloc 292286 events per sec 7:hash_map_perf kmalloc 278138 events per sec 0:hash_map_perf kmalloc 265031 events per sec [root@hello bpf]# ./map_perf_test 4 8 16384 2:hash_map_perf kmalloc 359201 events per sec 6:hash_map_perf kmalloc 302475 events per sec 3:hash_map_perf kmalloc 298583 events per sec 7:hash_map_perf kmalloc 301594 events per sec 0:hash_map_perf kmalloc 295210 events per sec 4:hash_map_perf kmalloc 230496 events per sec 5:hash_map_perf kmalloc 163530 events per sec 1:hash_map_perf kmalloc 153520 events per sec After reuse-after-rcu-gp: [root@hello bpf]# ./map_perf_test 4 8 1:hash_map_perf kmalloc 203252 events per sec 2:hash_map_perf kmalloc 181777 events per sec 6:hash_map_perf kmalloc 183467 events per sec 4:hash_map_perf kmalloc 182590 events per sec 0:hash_map_perf kmalloc 180840 events per sec 3:hash_map_perf kmalloc 179875 events per sec 5:hash_map_perf kmalloc 152250 events per sec 7:hash_map_perf kmalloc 137500 events per sec [root@hello bpf]# ./map_perf_test 4 8 16384 4:hash_map_perf kmalloc 203983 events per sec 5:hash_map_perf kmalloc 197902 events per sec 2:hash_map_perf kmalloc 184693 events per sec 3:hash_map_perf kmalloc 185341 events per sec 1:hash_map_perf kmalloc 183064 events per sec 7:hash_map_perf kmalloc 181148 events per sec 0:hash_map_perf kmalloc 178520 events per sec 6:hash_map_perf kmalloc 179340 events per sec I also run map_perf_test on a physical x86-64 host with 72 CPUs. The performances for "./map_perf_test 4 8" are similar, but there is obvious performance degradation for "./map_perf_test 4 8 16384" Before reuse-after-rcu-gp: [houtao@fedora bpf]$ sudo ./map_perf_test 4 8 1:hash_map_perf kmalloc 41711 events per sec 3:hash_map_perf kmalloc 41352 events per sec 4:hash_map_perf kmalloc 41352 events per sec 0:hash_map_perf kmalloc 41008 events per sec 7:hash_map_perf kmalloc 41086 events per sec 5:hash_map_perf kmalloc 41038 events per sec 2:hash_map_perf kmalloc 40971 events per sec 6:hash_map_perf kmalloc 41008 events per sec [houtao@fedora bpf]$ sudo ./map_perf_test 4 8 16384 1:hash_map_perf kmalloc 388088 events per sec 7:hash_map_perf kmalloc 391843 events per sec 6:hash_map_perf kmalloc 387901 events per sec 3:hash_map_perf kmalloc 356096 events per sec 4:hash_map_perf kmalloc 356836 events per sec 2:hash_map_perf kmalloc 349728 events per sec 0:hash_map_perf kmalloc 345792 events per sec 5:hash_map_perf kmalloc 346742 events per sec After reuse-after-rcu-gp: [houtao@fedora bpf]$ sudo ./map_perf_test 4 8 4:hash_map_perf kmalloc 42667 events per sec 1:hash_map_perf kmalloc 42206 events per sec 5:hash_map_perf kmalloc 42264 events per sec 6:hash_map_perf kmalloc 42196 events per sec 7:hash_map_perf kmalloc 42142 events per sec 2:hash_map_perf kmalloc 42028 events per sec 0:hash_map_perf kmalloc 41933 events per sec 3:hash_map_perf kmalloc 41986 events per sec [houtao@fedora bpf]$ sudo ./map_perf_test 4 8 16384 5:hash_map_perf kmalloc 655628 events per sec 7:hash_map_perf kmalloc 650691 events per sec 2:hash_map_perf kmalloc 624384 events per sec 1:hash_map_perf kmalloc 615011 events per sec 3:hash_map_perf kmalloc 618335 events per sec 4:hash_map_perf kmalloc 624072 events per sec 6:hash_map_perf kmalloc 628559 events per sec 0:hash_map_perf kmalloc 585384 events per sec So could you please double check your setup and rerun map_perf_test ? If there is no performance degradation, could you please share your setup and your kernel configure file ? > > I've applied patch 1. > Please respin with patch 2 doing no more than 10 map_updates under rcu lock > and remove *_tail optimization from patch 3. > Just do llist_for_each_safe() when you move elements from one list to another. > And let's brainstorm further. > Please do not delay.