On Tue, Jun 06, 2023 at 08:30:58PM +0800, Hou Tao wrote: > Hi, > > On 6/6/2023 11:53 AM, Hou Tao wrote: > > From: Hou Tao <houtao1@xxxxxxxxxx> > > > > Hi, > > > > The implementation of v4 is mainly based on suggestions from Alexi [0]. > > There are still pending problems for the current implementation as shown > > in the benchmark result in patch #3, but there was a long time from the > > posting of v3, so posting v4 here for further disscussions and more > > suggestions. > > > > The first problem is the huge memory usage compared with bpf memory > > allocator which does immediate reuse: > > > > htab-mem-benchmark (reuse-after-RCU-GP): > > | name | loop (k/s)| average memory (MiB)| peak memory (MiB)| > > | -- | -- | -- | -- | > > | no_op | 1159.18 | 0.99 | 0.99 | > > | overwrite | 11.00 | 2288 | 4109 | > > | batch_add_batch_del| 8.86 | 1558 | 2763 | > > | add_del_on_diff_cpu| 4.74 | 11.39 | 14.77 | > > > > htab-mem-benchmark (immediate-reuse): > > | name | loop (k/s)| average memory (MiB)| peak memory (MiB)| > > | -- | -- | -- | -- | > > | no_op | 1160.66 | 0.99 | 1.00 | > > | overwrite | 28.52 | 2.46 | 2.73 | > > | batch_add_batch_del| 11.50 | 2.69 | 2.95 | > > | add_del_on_diff_cpu| 3.75 | 15.85 | 24.24 | > > > > It seems the direct reason is the slow RCU grace period. During > > benchmark, the elapsed time when reuse_rcu() callback is called is about > > 100ms or even more (e.g., 2 seconds). I suspect the global per-bpf-ma > > spin-lock and the irq-work running in the contex of freeing process will > > increase the running overhead of bpf program, the running time of > > getpgid() is increased, the contex switch is slowed down and the RCU > > grace period increases [1], but I am still diggin into it. > For reuse-after-RCU-GP flavor, by removing per-bpf-ma reusable list > (namely bpf_mem_shared_cache) and using per-cpu reusable list (like v3 > did) instead, the memory usage of htab-mem-benchmark will decrease a lot: > > htab-mem-benchmark (reuse-after-RCU-GP + per-cpu reusable list): > | name | loop (k/s)| average memory (MiB)| peak memory (MiB)| > | -- | -- | -- | -- | > | no_op | 1165.38 | 0.97 | 1.00 | > | overwrite | 17.25 | 626.41 | 781.82 | > | batch_add_batch_del| 11.51 | 398.56 | 500.29 | > | add_del_on_diff_cpu| 4.21 | 31.06 | 48.84 | > > But the memory usage is still large compared with v3 and the elapsed > time of reuse_rcu() callback is about 90~200ms. Compared with v3, there > are still two differences: > 1) v3 uses kmalloc() to allocate multiple inflight RCU callbacks to > accelerate the reuse of freed objects. > 2) v3 uses kworker instead of irq_work for free procedure. > > For 1), after using kmalloc() in irq_work to allocate multiple inflight > RCU callbacks (namely reuse_rcu()), the memory usage decreases a bit, > but is not enough: > > htab-mem-benchmark (reuse-after-RCU-GP + per-cpu reusable list + multiple reuse_rcu() callbacks): > | name | loop (k/s)| average memory (MiB)| peak memory (MiB)| > | -- | -- | -- | -- | > | no_op | 1247.00 | 0.97 | 1.00 | > | overwrite | 16.56 | 490.18 | 557.17 | > | batch_add_batch_del| 11.31 | 276.32 | 360.89 | > | add_del_on_diff_cpu| 4.00 | 24.76 | 42.58 | > > So it seems the large memory usage is due to irq_work (reuse_bulk) used > for free procedure. However after increasing the threshold for invoking > irq_work reuse_bulk (e.g., use 10 * c->high_watermark), but there is no > big difference in the memory usage and the delayed time for RCU > callbacks. Perhaps the reason is that although the number of reuse_bulk > irq_work calls is reduced but the time of alloc_bulk() irq_work calls is > increased because there are no reusable objects. The large memory usage is because the benchmark in patch 2 is abusing it. It's doing one bpf_loop() over 16k elements (in case of 1 producer) and 16k/8 loops for --producers=8. That's 2k memory allocations that have to wait for RCU GP. Of course that's a ton of memory. As far as implementation in patch 3 please respin it asap and remove *_tail optimization. It makes the code hard to read and doesn't buy us anything. Other than that the algorithm looks fine. > > Another problem is the performance degradation compared with immediate > > reuse and the output from perf report shown the per-bpf-ma spin-lock is a > > top-one hotspot: That's not what I see. Hot spin_lock is in generic htab code. Not it ma. I still believe per-bpf-ma spin-lock is fine. The bench in patch 2 is measuring something that no real bpf prog cares about. See how map_perf_test is doing: for (i = 0; i < 10; i++) { bpf_map_update_elem(&hash_map_alloc, &key, &init_val, BPF_ANY); Even 10 map updates for the same map in a single bpf prog invocation is not realistic. 16k/8 is beyond any normal scenario. There is no reason to optimize bpf_ma for the case of htab abuse. > > map_perf_test (reuse-after-RCU-GP) > > 0:hash_map_perf kmalloc 194677 events per sec > > > > map_perf_test (immediate reuse) > > 2:hash_map_perf kmalloc 384527 events per sec For some reason I cannot reproduce the slow down with map_perf_test 4 8. I see the same perf with/without patch 3. I've applied patch 1. Please respin with patch 2 doing no more than 10 map_updates under rcu lock and remove *_tail optimization from patch 3. Just do llist_for_each_safe() when you move elements from one list to another. And let's brainstorm further. Please do not delay.