On Tue, Jun 6, 2023 at 6:19 PM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote: > > Hi, > > On 6/7/2023 5:04 AM, Alexei Starovoitov wrote: > > On Tue, Jun 06, 2023 at 08:30:58PM +0800, Hou Tao wrote: > >> Hi, > >> > >> On 6/6/2023 11:53 AM, Hou Tao wrote: > >>> From: Hou Tao <houtao1@xxxxxxxxxx> > >>> > >>> Hi, > >>> > >>> The implementation of v4 is mainly based on suggestions from Alexi [0]. > >>> There are still pending problems for the current implementation as shown > >>> in the benchmark result in patch #3, but there was a long time from the > >>> posting of v3, so posting v4 here for further disscussions and more > >>> suggestions. > >>> > >>> The first problem is the huge memory usage compared with bpf memory > >>> allocator which does immediate reuse: > >>> > >>> htab-mem-benchmark (reuse-after-RCU-GP): > >>> | name | loop (k/s)| average memory (MiB)| peak memory (MiB)| > >>> | -- | -- | -- | -- | > >>> | no_op | 1159.18 | 0.99 | 0.99 | > >>> | overwrite | 11.00 | 2288 | 4109 | > >>> | batch_add_batch_del| 8.86 | 1558 | 2763 | > >>> | add_del_on_diff_cpu| 4.74 | 11.39 | 14.77 | > >>> > >>> htab-mem-benchmark (immediate-reuse): > >>> | name | loop (k/s)| average memory (MiB)| peak memory (MiB)| > >>> | -- | -- | -- | -- | > >>> | no_op | 1160.66 | 0.99 | 1.00 | > >>> | overwrite | 28.52 | 2.46 | 2.73 | > >>> | batch_add_batch_del| 11.50 | 2.69 | 2.95 | > >>> | add_del_on_diff_cpu| 3.75 | 15.85 | 24.24 | > >>> > >>> It seems the direct reason is the slow RCU grace period. During > >>> benchmark, the elapsed time when reuse_rcu() callback is called is about > >>> 100ms or even more (e.g., 2 seconds). I suspect the global per-bpf-ma > >>> spin-lock and the irq-work running in the contex of freeing process will > >>> increase the running overhead of bpf program, the running time of > >>> getpgid() is increased, the contex switch is slowed down and the RCU > >>> grace period increases [1], but I am still diggin into it. > >> For reuse-after-RCU-GP flavor, by removing per-bpf-ma reusable list > >> (namely bpf_mem_shared_cache) and using per-cpu reusable list (like v3 > >> did) instead, the memory usage of htab-mem-benchmark will decrease a lot: > >> > >> htab-mem-benchmark (reuse-after-RCU-GP + per-cpu reusable list): > >> | name | loop (k/s)| average memory (MiB)| peak memory (MiB)| > >> | -- | -- | -- | -- | > >> | no_op | 1165.38 | 0.97 | 1.00 | > >> | overwrite | 17.25 | 626.41 | 781.82 | > >> | batch_add_batch_del| 11.51 | 398.56 | 500.29 | > >> | add_del_on_diff_cpu| 4.21 | 31.06 | 48.84 | > >> > >> But the memory usage is still large compared with v3 and the elapsed > >> time of reuse_rcu() callback is about 90~200ms. Compared with v3, there > >> are still two differences: > >> 1) v3 uses kmalloc() to allocate multiple inflight RCU callbacks to > >> accelerate the reuse of freed objects. > >> 2) v3 uses kworker instead of irq_work for free procedure. > >> > >> For 1), after using kmalloc() in irq_work to allocate multiple inflight > >> RCU callbacks (namely reuse_rcu()), the memory usage decreases a bit, > >> but is not enough: > >> > >> htab-mem-benchmark (reuse-after-RCU-GP + per-cpu reusable list + multiple reuse_rcu() callbacks): > >> | name | loop (k/s)| average memory (MiB)| peak memory (MiB)| > >> | -- | -- | -- | -- | > >> | no_op | 1247.00 | 0.97 | 1.00 | > >> | overwrite | 16.56 | 490.18 | 557.17 | > >> | batch_add_batch_del| 11.31 | 276.32 | 360.89 | > >> | add_del_on_diff_cpu| 4.00 | 24.76 | 42.58 | > >> > >> So it seems the large memory usage is due to irq_work (reuse_bulk) used > >> for free procedure. However after increasing the threshold for invoking > >> irq_work reuse_bulk (e.g., use 10 * c->high_watermark), but there is no > >> big difference in the memory usage and the delayed time for RCU > >> callbacks. Perhaps the reason is that although the number of reuse_bulk > >> irq_work calls is reduced but the time of alloc_bulk() irq_work calls is > >> increased because there are no reusable objects. > > The large memory usage is because the benchmark in patch 2 is abusing it. > > It's doing one bpf_loop() over 16k elements (in case of 1 producer) > > and 16k/8 loops for --producers=8. > > That's 2k memory allocations that have to wait for RCU GP. > > Of course that's a ton of memory. > I don't agree that. Because in v3, the benchmark is the same, but both > the performance and the memory usage are better than v4. Even compared > with "htab-mem-benchmark (reuse-after-RCU-GP + per-cpu reusable list + > multiple reuse_rcu() callbacks)" above, the memory usage in v3 is still > much smaller as shown below. If the large memory usage is due to the > abuse in benchmark, how do you explain the memory usage in v3 ? There could have been implementation bugs or whatever else. The main point is the bench test is not realistic and should not be used to make design decisions. > The reason I added tail for each list is that there could be thousands > even ten thousands elements in these lists and there is no need to spend > CPU time to traversal these list one by one. It maybe a premature > optimization. So let me remove tails from these list first and I will > try to add these tails back later and check whether or not there is any > performance improvement. There will be thousands of elements only because the bench test is wrong. It's doing something no real prog would do. > I have a different view for the benchmark. Firstly htab is not the only > user of bpf memory allocator, secondly we can't predict the exact > behavior of bpf programs, so I think to stress bpf memory allocator for > various kinds of use case is good for its broad usage. It is not a stress test. It's an abuse. A stress test would be something that can happen in practice. Doing thousands map_updates in a forever loop is not something useful code would do. For example call_rcu_tasks_trace is not design to be called millions times a second. It's an anti-pattern and rcu core won't be optimized to do so. rcu, srcu, rcu_task_trace have different usage patterns. The programmer has to correctly pick one depending on the use case. Same with bpf htab. If somebody has a real need to do thousands updates under rcu lock they should be using preallocated map and deal with immediate reuse.