Re: [RFC bpf-next v3 3/6] bpf: Introduce BPF_MA_REUSE_AFTER_RCU_GP

Hou Tao <houtao@xxxxxxxxxxxxxxx> · Fri, 2 Jun 2023 10:39:50 +0800

Hi,

On 6/2/2023 1:36 AM, Alexei Starovoitov wrote:
> On Wed, May 3, 2023 at 7:30 PM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote:
>>> Construct the patch series as:
>>> - prep patches
>>> - benchmark
>>> - unconditional convert of bpf_ma to REUSE_AFTER_rcu_GP_and_free_after_rcu_tasks_trace
>>>   with numbers from bench(s) before and after this patch.
>> Thanks again for the suggestion. Will do in v4.
>
> It's been a month. Any update?
>
> Should we take over this work if you're busy?
Sorry for the delay. I should post some progress information about the
patch set early. The patch set is simpler compared with v3, I had
implemented v4 about two weeks ago. The problem is v4 don't work as
expected: its memory usage is huge compared with v3. The following is
the output from htab-mem benchmark:

overwrite:
Summary: loop   11.07 ±    1.25k/s, memory usage  995.08 ±  680.87MiB,
peak memory usage 2183.38MiB
batch_add_batch_del:
Summary: loop   11.48 ±    1.24k/s, memory usage 1393.36 ±  780.41MiB,
peak memory usage 2836.68MiB
add_del_on_diff_cpu:
Summary: loop    6.07 ±    0.69k/s, memory usage   14.44 ±    2.34MiB,
peak memory usage   20.30MiB

The direct reason for the huge memory usage is slower RCU grace period.
The RCU grace period used for reuse is much longer compared with v3 and
it is about 100ms or more (e.g, 2.6s). I am still trying to find out the
root cause of the slow RCU grace period. The first guest is the running
time of bpf program attached to getpgid() is longer, so the context
switch in bench is slowed down. The hist-diagram of getpgid() latency in
v4 indeed manifests a lot of abnormal tail latencies compared with v3 as
shown below.

v3 getpid() latency during overwrite benchmark:
@hist_ms:
[0]               193451
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[1]                  767
|                                                    |
[2, 4)                75
|                                                    |
[4, 8)                 1
|                                                    |

v4 getpid() latency during overwrite benchmark:
@hist_ms:
[0]                86270
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[1]                31252
|@@@@@@@@@@@@@@@@@@                                  |
[2, 4)                 1
|                                                    |
[4, 8)                 0
|                                                    |
[8, 16)                0
|                                                    |
[16, 32)               0
|                                                    |
[32, 64)               0
|                                                    |
[64, 128)              0
|                                                    |
[128, 256)             3
|                                                    |
[256, 512)             2
|                                                    |
[512, 1K)              1
|                                                    |
[1K, 2K)               2
|                                                    |
[2K, 4K)               1
|                                                    |

I think the newly-added global spin-lock in memory allocator and
irq-work running under the context of free procedure may lead to
abnormal tail latency and I am trying to demonstrate that by using
fine-grain locks and kworker (just temporarily). But on the other side,
considering the number of abnormal tail latency is much smaller compared
with the total number of getpgid() syscall, so I think maybe there is
still other causes for the slow RCU GP.

Because the progress of v4 is delayed, so how about I post v4 as soon as
possible for discussion (maybe I did it wrong) and at the same time I
continue to investigate the slow RCU grace period problem (I will try to
get some help from RCU community) ?

Regards,
Tao