Re: [RFC bpf-next v3 3/6] bpf: Introduce BPF_MA_REUSE_AFTER_RCU_GP

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Fri, 2 Jun 2023 09:25:34 -0700

On Thu, Jun 1, 2023 at 7:40 PM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote:
>
> Hi,
>
> On 6/2/2023 1:36 AM, Alexei Starovoitov wrote:
> > On Wed, May 3, 2023 at 7:30 PM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote:
> >>> Construct the patch series as:
> >>> - prep patches
> >>> - benchmark
> >>> - unconditional convert of bpf_ma to REUSE_AFTER_rcu_GP_and_free_after_rcu_tasks_trace
> >>>   with numbers from bench(s) before and after this patch.
> >> Thanks again for the suggestion. Will do in v4.
> >
> > It's been a month. Any update?
> >
> > Should we take over this work if you're busy?
> Sorry for the delay. I should post some progress information about the
> patch set early. The patch set is simpler compared with v3, I had
> implemented v4 about two weeks ago. The problem is v4 don't work as
> expected: its memory usage is huge compared with v3. The following is
> the output from htab-mem benchmark:
>
> overwrite:
> Summary: loop   11.07 ±    1.25k/s, memory usage  995.08 ±  680.87MiB,
> peak memory usage 2183.38MiB
> batch_add_batch_del:
> Summary: loop   11.48 ±    1.24k/s, memory usage 1393.36 ±  780.41MiB,
> peak memory usage 2836.68MiB
> add_del_on_diff_cpu:
> Summary: loop    6.07 ±    0.69k/s, memory usage   14.44 ±    2.34MiB,
> peak memory usage   20.30MiB
>
> The direct reason for the huge memory usage is slower RCU grace period.
> The RCU grace period used for reuse is much longer compared with v3 and
> it is about 100ms or more (e.g, 2.6s). I am still trying to find out the
> root cause of the slow RCU grace period. The first guest is the running
> time of bpf program attached to getpgid() is longer, so the context
> switch in bench is slowed down. The hist-diagram of getpgid() latency in
> v4 indeed manifests a lot of abnormal tail latencies compared with v3 as
> shown below.
>
> v3 getpid() latency during overwrite benchmark:
> @hist_ms:
> [0]               193451
> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [1]                  767
> |                                                    |
> [2, 4)                75
> |                                                    |
> [4, 8)                 1
> |                                                    |
>
> v4 getpid() latency during overwrite benchmark:
> @hist_ms:
> [0]                86270
> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [1]                31252
> |@@@@@@@@@@@@@@@@@@                                  |
> [2, 4)                 1
> |                                                    |
> [4, 8)                 0
> |                                                    |
> [8, 16)                0
> |                                                    |
> [16, 32)               0
> |                                                    |
> [32, 64)               0
> |                                                    |
> [64, 128)              0
> |                                                    |
> [128, 256)             3
> |                                                    |
> [256, 512)             2
> |                                                    |
> [512, 1K)              1
> |                                                    |
> [1K, 2K)               2
> |                                                    |
> [2K, 4K)               1
> |                                                    |
>
> I think the newly-added global spin-lock in memory allocator and
> irq-work running under the context of free procedure may lead to
> abnormal tail latency and I am trying to demonstrate that by using
> fine-grain locks and kworker (just temporarily). But on the other side,
> considering the number of abnormal tail latency is much smaller compared
> with the total number of getpgid() syscall, so I think maybe there is
> still other causes for the slow RCU GP.
>
> Because the progress of v4 is delayed, so how about I post v4 as soon as
> possible for discussion (maybe I did it wrong) and at the same time I
> continue to investigate the slow RCU grace period problem (I will try to
> get some help from RCU community) ?

Yes. Please send v4. Let's investigate huge memory consumption together.