Hi, On 6/2/2023 1:36 AM, Alexei Starovoitov wrote: > On Wed, May 3, 2023 at 7:30 PM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote: >>> Construct the patch series as: >>> - prep patches >>> - benchmark >>> - unconditional convert of bpf_ma to REUSE_AFTER_rcu_GP_and_free_after_rcu_tasks_trace >>> with numbers from bench(s) before and after this patch. >> Thanks again for the suggestion. Will do in v4. > > It's been a month. Any update? > > Should we take over this work if you're busy? Sorry for the delay. I should post some progress information about the patch set early. The patch set is simpler compared with v3, I had implemented v4 about two weeks ago. The problem is v4 don't work as expected: its memory usage is huge compared with v3. The following is the output from htab-mem benchmark: overwrite: Summary: loop 11.07 ± 1.25k/s, memory usage 995.08 ± 680.87MiB, peak memory usage 2183.38MiB batch_add_batch_del: Summary: loop 11.48 ± 1.24k/s, memory usage 1393.36 ± 780.41MiB, peak memory usage 2836.68MiB add_del_on_diff_cpu: Summary: loop 6.07 ± 0.69k/s, memory usage 14.44 ± 2.34MiB, peak memory usage 20.30MiB The direct reason for the huge memory usage is slower RCU grace period. The RCU grace period used for reuse is much longer compared with v3 and it is about 100ms or more (e.g, 2.6s). I am still trying to find out the root cause of the slow RCU grace period. The first guest is the running time of bpf program attached to getpgid() is longer, so the context switch in bench is slowed down. The hist-diagram of getpgid() latency in v4 indeed manifests a lot of abnormal tail latencies compared with v3 as shown below. v3 getpid() latency during overwrite benchmark: @hist_ms: [0] 193451 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [1] 767 | | [2, 4) 75 | | [4, 8) 1 | | v4 getpid() latency during overwrite benchmark: @hist_ms: [0] 86270 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [1] 31252 |@@@@@@@@@@@@@@@@@@ | [2, 4) 1 | | [4, 8) 0 | | [8, 16) 0 | | [16, 32) 0 | | [32, 64) 0 | | [64, 128) 0 | | [128, 256) 3 | | [256, 512) 2 | | [512, 1K) 1 | | [1K, 2K) 2 | | [2K, 4K) 1 | | I think the newly-added global spin-lock in memory allocator and irq-work running under the context of free procedure may lead to abnormal tail latency and I am trying to demonstrate that by using fine-grain locks and kworker (just temporarily). But on the other side, considering the number of abnormal tail latency is much smaller compared with the total number of getpgid() syscall, so I think maybe there is still other causes for the slow RCU GP. Because the progress of v4 is delayed, so how about I post v4 as soon as possible for discussion (maybe I did it wrong) and at the same time I continue to investigate the slow RCU grace period problem (I will try to get some help from RCU community) ? Regards, Tao