Re: Warnings in test_maps selftests in test_sockmap parts

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Wed, 10 Mar 2021 13:41:14 -0800

On Wed, Dec 23, 2020 at 12:13 PM Andrii Nakryiko
<andrii.nakryiko@xxxxxxxxx> wrote:
>
> On Tue, Dec 22, 2020 at 11:32 PM John Fastabend
> <john.fastabend@xxxxxxxxx> wrote:
> >
> > Andrii Nakryiko wrote:
> > > On Tue, Dec 22, 2020 at 11:44 AM Andrii Nakryiko
> > > <andrii.nakryiko@xxxxxxxxx> wrote:
> > > >
> > > > Hi John and Martin,
> > > >
> > > > I've noticed that all our CIs (libbpf and kernel-patches) started to
> > > > emit kernel warning when running test_maps test. I've narrowed it down
> > > > to test_sockmap() part of it. If I disable it, the warning goes away.
> > > > The warning looks like this:
> > > >
> > > > Failed sockmap unexpected timeout

So this is still happening, quite often. It would be good to take care
of it, it makes test_maps quite unstable and just reduces the
signal-to-noise ratio of kernel-patches CI :(

> > > > [   23.316720] ------------[ cut here ]------------
> > > > [   23.317615] WARNING: CPU: 0 PID: 93 at net/core/stream.c:208
> > > > sk_stream_kill_queues+0x111/0x120
> > > > [   23.319194] Modules linked in:
> > > > [   23.319698] CPU: 0 PID: 93 Comm: test_maps Not tainted
> > > > 5.10.0-rc7-02263-g0e12c0271887-dirty #51
> > > > [   23.321297] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> > > > BIOS 1.9.3-1.el7.centos 04/01/2014
> > > > [   23.322756] RIP: 0010:sk_stream_kill_queues+0x111/0x120
> > > > [   23.323668] Code: 00 85 c0 75 1f 85 f6 75 21 5b 5d c3 48 89 df e8
> > > > d5 fd fe ff 8b 83 68 02 00 00 8b b3 20 02 00 00 85 c0 74 e1 0f 0b 85
> > > > f6 74 df <0f> 0b 5b 5d c3 0f 0b eb ac 66 0f 1f 44 00 00 0f 1f 44 00 00
> > > > 41 57
> > > > [   23.326251] RSP: 0018:ffffc900001d7d80 EFLAGS: 00010202
> > > > [   23.326863] RAX: 0000000000000000 RBX: ffff88813a44a280 RCX: 0000000000000000
> > > > [   23.327597] RDX: 0000000000000001 RSI: 0000000000000aec RDI: ffff88813a44a3d0
> > > > [   23.328375] RBP: ffff88813a44a3d0 R08: 0000000000000000 R09: 0000000000000001
> > > > [   23.329300] R10: 0000000000000050 R11: 0000000000000000 R12: ffff888100362200
> > > > [   23.330281] R13: ffff88802564b8e0 R14: ffff8881006175f0 R15: 0000000000000000
> > > > [   23.331106] FS:  00007f56532ad740(0000) GS:ffff88813bc00000(0000)
> > > > knlGS:0000000000000000
> > > > [   23.332162] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > [   23.333011] CR2: 000000000064d448 CR3: 0000000101fd2000 CR4: 00000000000006f0
> > > > [   23.334051] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > > [   23.334876] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > > > [   23.335777] Call Trace:
> > > > [   23.336027]  inet_csk_destroy_sock+0x4f/0x120
> > > > [   23.336542]  tcp_rcv_state_process+0xcee/0x1240
> > > > [   23.337010]  tcp_v4_do_rcv+0xb2/0x1e0
> > > > [   23.337541]  __release_sock+0x5c/0xf0
> > > > [   23.337956]  __tcp_close+0x1cb/0x4c0
> > > > [   23.338351]  tcp_close+0x20/0x70
> > > > [   23.338875]  inet_release+0x3c/0x70
> > > > [   23.339466]  __sock_release+0x37/0xa0
> > > > [   23.340003]  sock_close+0x14/0x20
> > > > [   23.340580]  __fput+0xb1/0x260
> > > > [   23.341091]  task_work_run+0x59/0xa0
> > > > [   23.341685]  exit_to_user_mode_prepare+0x152/0x160
> > > > [   23.342213]  syscall_exit_to_user_mode+0x38/0x240
> > > > [   23.342758]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > > [   23.343250] RIP: 0033:0x7f565348b797
> > > > [   23.343702] Code: 64 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00
> > > > 00 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 03 00 00
> > > > 00 0f 05 <48> 3d 00 f0 ff ff 77 41 c3 48 83 ec 18 89 7c 24 0c e8 f3 fb
> > > > ff ff
> > > > [   23.345817] RSP: 002b:00007ffd178c1b68 EFLAGS: 00000246 ORIG_RAX:
> > > > 0000000000000003
> > > > [   23.346972] RAX: 0000000000000000 RBX: 00007ffd178c1be0 RCX: 00007f565348b797
> > > > [   23.348069] RDX: 0000000000000078 RSI: 00007ffd178c1ae0 RDI: 0000000000000008
> > > > [   23.349139] RBP: 00007ffd178c1cd0 R08: 0000000000000017 R09: 0000000000000016
> > > > [   23.350210] R10: 0000000000000016 R11: 0000000000000246 R12: 00007ffd178c1c20
> > > > [   23.351442] R13: 0000000000000016 R14: 000000000000000f R15: 0000000000000017
> > > > [   23.352500] irq event stamp: 1775071
> > > > [   23.352999] hardirqs last  enabled at (1775079):
> > > > [<ffffffff810d0dbf>] console_unlock+0x48f/0x590
> > > > [   23.354360] hardirqs last disabled at (1775088):
> > > > [<ffffffff810d0d27>] console_unlock+0x3f7/0x590
> > > > [   23.355748] softirqs last  enabled at (1775026):
> > > > [<ffffffff81a00ebf>] asm_call_irq_on_stack+0xf/0x20
> > > > [   23.357071] softirqs last disabled at (1775021):
> > > > [<ffffffff81a00ebf>] asm_call_irq_on_stack+0xf/0x20
> > > > [   23.358497] ---[ end trace 1fe2d145c0b7718d ]---
> > > >
> > > > When trying to bisect, it turns out we get this warning even with
> > > > older versions of bpf-next tree, which didn't seem to trigger this
> > > > warning before. One thing that seems to have changed in our CI setup
> > > > is that we went from 4 CPU VM to 2 CPU VM (due to Travis CI
> > > > limitations). So that might have changed some timings and
> > > > interactions.
> > > >
> > > > I'm dealing with other fallouts of this 4 to 2 CPU reduction and this
> > > > warning itself doesn't cause the failure in libbpf CI (but it will in
> > >
> > > I take this back, it does return error from test_maps due to "Failed
> > > sockmap unexpected timeout", so both CIs are broken, I've just
> > > disabled test_maps in libbpf CI for now. And I'm sending a work-around
> > > for EBUSY errors from hashmap on update/delete in a separate patch.
> >
> > OK thanks Andrii, I'll take a look. If its not obvious to me what the problem
> > is though it might slip into the new year as I'm on PTO shortly.
> >
>
> Yeah, no worries. I just wanted to give a heads up and get it off my
> head :) Don't lose your sleep over it.
>
> > Also FWIW I'm sitting on some patches to start pulling the tests from
> > test_sockmap into the more normal test progs framework. Then we can just
> > keep test_sockmap around as a tool to test different setups or move it
> > out of tree. As far as I understand it CI doesn't run test_sockmap at the
> > moment.
>
> CI that tests all incoming patches is currently red due to this and
> the EBUSY problem I worked around in [0]. Libbpf CI doesn't run
> test_maps currently, so it's good. I hesitated to disable test_maps
> for patches CI, though, but we can do that temporarily as well, if
> necessary.
>
>   [0] https://patchwork.kernel.org/project/netdevbpf/patch/20201223200652.3417075-1-andrii@xxxxxxxxxx/
>
> >
> > Thanks,
> > John
> >
> > >
> > > > kernel-patches CI), but it would be great if you can take a look at
> > > > this and see if we can get it fixed soon-ish to get CI to a green
> > > > state.
> > > >
> > > > Thanks!
> > > >
> > > >
> > > > -- Andrii