Re: REGRESSION: RIP: 0010:skb_release_data+0xb8/0x1e0 in vhost/tun

Willem de Bruijn <willemdebruijn.kernel@xxxxxxxxx> · Tue, 19 Mar 2024 09:08:43 -0400

Igor Raits wrote:
> Hello,
> 
> We have started to observe kernel crashes on 6.7.y kernels (atm we
> have hit the issue 5 times on 6.7.5 and 6.7.10). On 6.6.9 where we
> have nodes of cluster it looks stable. Please see stacktrace below. If
> you need more information please let me know.
> 
> We do not have a consistent reproducer but when we put some bigger
> network load on a VM, the hypervisor's kernel crashes.
> 
> Help is much appreciated! We are happy to test any patches.
> 
> [62254.167584] stack segment: 0000 [#1] PREEMPT SMP NOPTI

Did you miss the first part of the Oops?

> [62254.173450] CPU: 63 PID: 11939 Comm: vhost-11890 Tainted: G
>    E      6.7.10-1.gdc.el9.x86_64 #1
> [62254.183743] Hardware name: Dell Inc. PowerEdge R7525/0H3K7P, BIOS
> 2.14.1 12/17/2023
> [62254.192083] RIP: 0010:skb_release_data+0xb8/0x1e0
> [62254.197357] Code: 48 83 c3 01 39 d8 7e 54 48 89 d8 48 c1 e0 04 41
> 80 7d 7e 00 49 8b 6c 04 30 79 0f 44 89 f6 48 89 ef e8 4c e4 ff ff 84
> c0 75 d0 <48> 8b 45 08 a8 01 0f 85 09 01 00 00 e9 d9 00 00 00 0f 1f 44
> 00 00
> [62254.217013] RSP: 0018:ffffa975a0247ba8 EFLAGS: 00010206
> [62254.222692] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000785
> [62254.230263] RDX: 0000000000000016 RSI: 0000000000000002 RDI: ffff989862b32b00
> [62254.237878] RBP: 4f2b318c69a8b0f9 R08: 000000000001fe4d R09: 000000000000003a
> [62254.245417] R10: 0000000000000000 R11: 0000000000001736 R12: ffff9880b819aec0
> [62254.252963] R13: ffff989862b32b00 R14: 0000000000000000 R15: 0000000000000002
> [62254.260591] FS:  00007f6cf388bf80(0000) GS:ffff98b85fbc0000(0000)
> knlGS:0000000000000000
> [62254.269061] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [62254.275170] CR2: 000000c002236020 CR3: 000000387d37a002 CR4: 0000000000770ef0
> [62254.282733] PKRU: 55555554
> [62254.285911] Call Trace:
> [62254.288884]  <TASK>
> [62254.291549]  ? die+0x33/0x90
> [62254.294769]  ? do_trap+0xe0/0x110
> [62254.298405]  ? do_error_trap+0x65/0x80
> [62254.302471]  ? exc_stack_segment+0x35/0x50
> [62254.306884]  ? asm_exc_stack_segment+0x22/0x30
> [62254.311637]  ? skb_release_data+0xb8/0x1e0
> [62254.316047]  kfree_skb_list_reason+0x6d/0x210
> [62254.320697]  ? free_unref_page_commit+0x80/0x2f0
> [62254.325700]  ? free_unref_page+0xe9/0x130
> [62254.330013]  skb_release_data+0xfc/0x1e0
> [62254.334261]  consume_skb+0x45/0xd0
> [62254.338077]  tun_do_read+0x68/0x1f0 [tun]
> [62254.342414]  tun_recvmsg+0x7e/0x160 [tun]
> [62254.346696]  handle_rx+0x3ab/0x750 [vhost_net]
> [62254.351488]  vhost_worker+0x42/0x70 [vhost]
> [62254.355934]  vhost_task_fn+0x4b/0xb0

Neither tun nor vhost_net saw significant changes between the two
reported kernels.

    $ git log --oneline v6.6..v6.7 -- drivers/net/tun.c drivers/vhost/net.c | wc -l 
    0

    $ git log --oneline linux/v6.6.9..linux/v6.7.5 -- drivers/net/tun.c drivers/vhost/net.c
    6438382dd9f8 tun: add missing rx stats accounting in tun_xdp_act
    4efd09da0d49 tun: fix missing dropped counter in tun_xdp_act

So the cause is likely in the code that generated the skb or something
that modified it along the way.

It could be helpful if it is possible to bisect further. Though odds
are that the issue is between v6.6 and v6.7, not introduced in the
stable backports after that. So it is a large target.

Getting the exact line in skb_release_data that causes the Oops
would be helpful too, e.g.,

gdb vmlinux
list *(skb_release_data+0xb8)