Re: Warnings in test_maps selftests in test_sockmap parts

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Tue, 22 Dec 2020 11:47:14 -0800

On Tue, Dec 22, 2020 at 11:44 AM Andrii Nakryiko
<andrii.nakryiko@xxxxxxxxx> wrote:
>
> Hi John and Martin,
>
> I've noticed that all our CIs (libbpf and kernel-patches) started to
> emit kernel warning when running test_maps test. I've narrowed it down
> to test_sockmap() part of it. If I disable it, the warning goes away.
> The warning looks like this:
>
> Failed sockmap unexpected timeout
> [   23.316720] ------------[ cut here ]------------
> [   23.317615] WARNING: CPU: 0 PID: 93 at net/core/stream.c:208
> sk_stream_kill_queues+0x111/0x120
> [   23.319194] Modules linked in:
> [   23.319698] CPU: 0 PID: 93 Comm: test_maps Not tainted
> 5.10.0-rc7-02263-g0e12c0271887-dirty #51
> [   23.321297] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS 1.9.3-1.el7.centos 04/01/2014
> [   23.322756] RIP: 0010:sk_stream_kill_queues+0x111/0x120
> [   23.323668] Code: 00 85 c0 75 1f 85 f6 75 21 5b 5d c3 48 89 df e8
> d5 fd fe ff 8b 83 68 02 00 00 8b b3 20 02 00 00 85 c0 74 e1 0f 0b 85
> f6 74 df <0f> 0b 5b 5d c3 0f 0b eb ac 66 0f 1f 44 00 00 0f 1f 44 00 00
> 41 57
> [   23.326251] RSP: 0018:ffffc900001d7d80 EFLAGS: 00010202
> [   23.326863] RAX: 0000000000000000 RBX: ffff88813a44a280 RCX: 0000000000000000
> [   23.327597] RDX: 0000000000000001 RSI: 0000000000000aec RDI: ffff88813a44a3d0
> [   23.328375] RBP: ffff88813a44a3d0 R08: 0000000000000000 R09: 0000000000000001
> [   23.329300] R10: 0000000000000050 R11: 0000000000000000 R12: ffff888100362200
> [   23.330281] R13: ffff88802564b8e0 R14: ffff8881006175f0 R15: 0000000000000000
> [   23.331106] FS:  00007f56532ad740(0000) GS:ffff88813bc00000(0000)
> knlGS:0000000000000000
> [   23.332162] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   23.333011] CR2: 000000000064d448 CR3: 0000000101fd2000 CR4: 00000000000006f0
> [   23.334051] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [   23.334876] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [   23.335777] Call Trace:
> [   23.336027]  inet_csk_destroy_sock+0x4f/0x120
> [   23.336542]  tcp_rcv_state_process+0xcee/0x1240
> [   23.337010]  tcp_v4_do_rcv+0xb2/0x1e0
> [   23.337541]  __release_sock+0x5c/0xf0
> [   23.337956]  __tcp_close+0x1cb/0x4c0
> [   23.338351]  tcp_close+0x20/0x70
> [   23.338875]  inet_release+0x3c/0x70
> [   23.339466]  __sock_release+0x37/0xa0
> [   23.340003]  sock_close+0x14/0x20
> [   23.340580]  __fput+0xb1/0x260
> [   23.341091]  task_work_run+0x59/0xa0
> [   23.341685]  exit_to_user_mode_prepare+0x152/0x160
> [   23.342213]  syscall_exit_to_user_mode+0x38/0x240
> [   23.342758]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [   23.343250] RIP: 0033:0x7f565348b797
> [   23.343702] Code: 64 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00
> 00 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 03 00 00
> 00 0f 05 <48> 3d 00 f0 ff ff 77 41 c3 48 83 ec 18 89 7c 24 0c e8 f3 fb
> ff ff
> [   23.345817] RSP: 002b:00007ffd178c1b68 EFLAGS: 00000246 ORIG_RAX:
> 0000000000000003
> [   23.346972] RAX: 0000000000000000 RBX: 00007ffd178c1be0 RCX: 00007f565348b797
> [   23.348069] RDX: 0000000000000078 RSI: 00007ffd178c1ae0 RDI: 0000000000000008
> [   23.349139] RBP: 00007ffd178c1cd0 R08: 0000000000000017 R09: 0000000000000016
> [   23.350210] R10: 0000000000000016 R11: 0000000000000246 R12: 00007ffd178c1c20
> [   23.351442] R13: 0000000000000016 R14: 000000000000000f R15: 0000000000000017
> [   23.352500] irq event stamp: 1775071
> [   23.352999] hardirqs last  enabled at (1775079):
> [<ffffffff810d0dbf>] console_unlock+0x48f/0x590
> [   23.354360] hardirqs last disabled at (1775088):
> [<ffffffff810d0d27>] console_unlock+0x3f7/0x590
> [   23.355748] softirqs last  enabled at (1775026):
> [<ffffffff81a00ebf>] asm_call_irq_on_stack+0xf/0x20
> [   23.357071] softirqs last disabled at (1775021):
> [<ffffffff81a00ebf>] asm_call_irq_on_stack+0xf/0x20
> [   23.358497] ---[ end trace 1fe2d145c0b7718d ]---
>
> When trying to bisect, it turns out we get this warning even with
> older versions of bpf-next tree, which didn't seem to trigger this
> warning before. One thing that seems to have changed in our CI setup
> is that we went from 4 CPU VM to 2 CPU VM (due to Travis CI
> limitations). So that might have changed some timings and
> interactions.
>
> I'm dealing with other fallouts of this 4 to 2 CPU reduction and this
> warning itself doesn't cause the failure in libbpf CI (but it will in

I take this back, it does return error from test_maps due to "Failed
sockmap unexpected timeout", so both CIs are broken, I've just
disabled test_maps in libbpf CI for now. And I'm sending a work-around
for EBUSY errors from hashmap on update/delete in a separate patch.

> kernel-patches CI), but it would be great if you can take a look at
> this and see if we can get it fixed soon-ish to get CI to a green
> state.
>
> Thanks!
>
>
> -- Andrii