Re: [PATCH bpf-next] selftests/bpf: reduce more flakyness in sockmap_listen

Cong Wang <xiyou.wangcong@xxxxxxxxx> · Fri, 3 Sep 2021 16:44:16 -0700

On Wed, Sep 1, 2021 at 8:35 PM sunyucong@xxxxxxxxx <sunyucong@xxxxxxxxx> wrote:
>
> On Wed, Sep 1, 2021 at 9:33 PM Cong Wang <xiyou.wangcong@xxxxxxxxx> wrote:
> >
> > On Tue, Aug 31, 2021 at 12:33 PM Cong Wang <xiyou.wangcong@xxxxxxxxx> wrote:
> > > Like I mentioned before, I suspect there is some delay in one of
> > > the queues on the way or there is a worker wakeup latency.
> > > I will try adding some tracepoints to see if I can capture it.
> > >
> >
> > I tried to revert this patch locally to reproduce the EAGAIN
> > failure, but even after repeating the sockmap_listen test hundreds
> > of times, I didn't see any failure here.
> >
> > If you are still interested in this issue, I'd suggest you adding some
> > tracepoints to see what happens to kworker or the packet queues.
> >
> > It does not look like a sockmap bug, otherwise I would be able to
> > reproduce it here.
> >
>
> Cong, the issue is not that read() sometimes returns EAGAIN.
>
> It is that when using select on the redirected socket,  it will hang forever.

Hmm? We don't use any select(), do we? Before your patch, I used
a for loop. With your patch, it is a loop with usleep().

Actually I just reproduced this EAGAIN issue here. I ran `git revert`
but it didn't actually revert your patch for some reason, so I had to
manually remove those usleep() and finally reproduced it.

I used strace -ttt to get the time spent on 100 times of read(), it is
about 0.2ms in total. However, runqslower shows the kworker wakeup
latency can be 10+ms:

19:29:16 kworker/2:0      19836           14071
19:29:18 kworker/1:0      19836           14369
19:29:20 ksoftirqd/2      19794           12731
19:29:20 kworker/2:0      23              11059
19:29:21 kworker/1:0      19836           11020

So clearly repeating read() for 100 times is too far away from the worst
delay. And the wakeup latency is only part of the packet latency, so in
other words, in the worst scenario a packet can be delayed for more
than 10ms, which is roughly 5000 times of read().

Anyway, this is a not a bug in sockmap, it is a problem of not using
blocking mode in sockmap_listen tests.

Thanks.