On Thu, Jun 1, 2023 at 7:40 PM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote: > > Hi, > > On 6/2/2023 1:36 AM, Alexei Starovoitov wrote: > > On Wed, May 3, 2023 at 7:30 PM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote: > >>> Construct the patch series as: > >>> - prep patches > >>> - benchmark > >>> - unconditional convert of bpf_ma to REUSE_AFTER_rcu_GP_and_free_after_rcu_tasks_trace > >>> with numbers from bench(s) before and after this patch. > >> Thanks again for the suggestion. Will do in v4. > > > > It's been a month. Any update? > > > > Should we take over this work if you're busy? > Sorry for the delay. I should post some progress information about the > patch set early. The patch set is simpler compared with v3, I had > implemented v4 about two weeks ago. The problem is v4 don't work as > expected: its memory usage is huge compared with v3. The following is > the output from htab-mem benchmark: > > overwrite: > Summary: loop 11.07 ± 1.25k/s, memory usage 995.08 ± 680.87MiB, > peak memory usage 2183.38MiB > batch_add_batch_del: > Summary: loop 11.48 ± 1.24k/s, memory usage 1393.36 ± 780.41MiB, > peak memory usage 2836.68MiB > add_del_on_diff_cpu: > Summary: loop 6.07 ± 0.69k/s, memory usage 14.44 ± 2.34MiB, > peak memory usage 20.30MiB > > The direct reason for the huge memory usage is slower RCU grace period. > The RCU grace period used for reuse is much longer compared with v3 and > it is about 100ms or more (e.g, 2.6s). I am still trying to find out the > root cause of the slow RCU grace period. The first guest is the running > time of bpf program attached to getpgid() is longer, so the context > switch in bench is slowed down. The hist-diagram of getpgid() latency in > v4 indeed manifests a lot of abnormal tail latencies compared with v3 as > shown below. > > v3 getpid() latency during overwrite benchmark: > @hist_ms: > [0] 193451 > |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > [1] 767 > | | > [2, 4) 75 > | | > [4, 8) 1 > | | > > v4 getpid() latency during overwrite benchmark: > @hist_ms: > [0] 86270 > |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > [1] 31252 > |@@@@@@@@@@@@@@@@@@ | > [2, 4) 1 > | | > [4, 8) 0 > | | > [8, 16) 0 > | | > [16, 32) 0 > | | > [32, 64) 0 > | | > [64, 128) 0 > | | > [128, 256) 3 > | | > [256, 512) 2 > | | > [512, 1K) 1 > | | > [1K, 2K) 2 > | | > [2K, 4K) 1 > | | > > I think the newly-added global spin-lock in memory allocator and > irq-work running under the context of free procedure may lead to > abnormal tail latency and I am trying to demonstrate that by using > fine-grain locks and kworker (just temporarily). But on the other side, > considering the number of abnormal tail latency is much smaller compared > with the total number of getpgid() syscall, so I think maybe there is > still other causes for the slow RCU GP. > > Because the progress of v4 is delayed, so how about I post v4 as soon as > possible for discussion (maybe I did it wrong) and at the same time I > continue to investigate the slow RCU grace period problem (I will try to > get some help from RCU community) ? Yes. Please send v4. Let's investigate huge memory consumption together.