Igor Raits wrote: > Hello Willem, > > On Tue, Mar 19, 2024 at 2:08 PM Willem de Bruijn > <willemdebruijn.kernel@xxxxxxxxx> wrote: > > > > Igor Raits wrote: > > > Hello, > > > > > > We have started to observe kernel crashes on 6.7.y kernels (atm we > > > have hit the issue 5 times on 6.7.5 and 6.7.10). On 6.6.9 where we > > > have nodes of cluster it looks stable. Please see stacktrace below. If > > > you need more information please let me know. > > > > > > We do not have a consistent reproducer but when we put some bigger > > > network load on a VM, the hypervisor's kernel crashes. > > > > > > Help is much appreciated! We are happy to test any patches. > > > > > > [62254.167584] stack segment: 0000 [#1] PREEMPT SMP NOPTI > > > > Did you miss the first part of the Oops? > > Actually I copied it as-is from our log system. As it is a physical > server, such logs are sent via netconsole to another server. This is > the first line I see in the log in the time segment. > > > > > > [62254.173450] CPU: 63 PID: 11939 Comm: vhost-11890 Tainted: G > > > E 6.7.10-1.gdc.el9.x86_64 #1 > > > [62254.183743] Hardware name: Dell Inc. PowerEdge R7525/0H3K7P, BIOS > > > 2.14.1 12/17/2023 > > > [62254.192083] RIP: 0010:skb_release_data+0xb8/0x1e0 > > > [62254.197357] Code: 48 83 c3 01 39 d8 7e 54 48 89 d8 48 c1 e0 04 41 > > > 80 7d 7e 00 49 8b 6c 04 30 79 0f 44 89 f6 48 89 ef e8 4c e4 ff ff 84 > > > c0 75 d0 <48> 8b 45 08 a8 01 0f 85 09 01 00 00 e9 d9 00 00 00 0f 1f 44 > > > 00 00 > > > [62254.217013] RSP: 0018:ffffa975a0247ba8 EFLAGS: 00010206 > > > [62254.222692] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000785 > > > [62254.230263] RDX: 0000000000000016 RSI: 0000000000000002 RDI: ffff989862b32b00 > > > [62254.237878] RBP: 4f2b318c69a8b0f9 R08: 000000000001fe4d R09: 000000000000003a > > > [62254.245417] R10: 0000000000000000 R11: 0000000000001736 R12: ffff9880b819aec0 > > > [62254.252963] R13: ffff989862b32b00 R14: 0000000000000000 R15: 0000000000000002 > > > [62254.260591] FS: 00007f6cf388bf80(0000) GS:ffff98b85fbc0000(0000) > > > knlGS:0000000000000000 > > > [62254.269061] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > > [62254.275170] CR2: 000000c002236020 CR3: 000000387d37a002 CR4: 0000000000770ef0 > > > [62254.282733] PKRU: 55555554 > > > [62254.285911] Call Trace: > > > [62254.288884] <TASK> > > > [62254.291549] ? die+0x33/0x90 > > > [62254.294769] ? do_trap+0xe0/0x110 > > > [62254.298405] ? do_error_trap+0x65/0x80 > > > [62254.302471] ? exc_stack_segment+0x35/0x50 > > > [62254.306884] ? asm_exc_stack_segment+0x22/0x30 > > > [62254.311637] ? skb_release_data+0xb8/0x1e0 > > > [62254.316047] kfree_skb_list_reason+0x6d/0x210 > > > [62254.320697] ? free_unref_page_commit+0x80/0x2f0 > > > [62254.325700] ? free_unref_page+0xe9/0x130 > > > [62254.330013] skb_release_data+0xfc/0x1e0 > > > [62254.334261] consume_skb+0x45/0xd0 > > > [62254.338077] tun_do_read+0x68/0x1f0 [tun] > > > [62254.342414] tun_recvmsg+0x7e/0x160 [tun] > > > [62254.346696] handle_rx+0x3ab/0x750 [vhost_net] > > > [62254.351488] vhost_worker+0x42/0x70 [vhost] > > > [62254.355934] vhost_task_fn+0x4b/0xb0 > > > > Neither tun nor vhost_net saw significant changes between the two > > reported kernels. > > > > $ git log --oneline v6.6..v6.7 -- drivers/net/tun.c drivers/vhost/net.c | wc -l > > 0 > > > > $ git log --oneline linux/v6.6.9..linux/v6.7.5 -- drivers/net/tun.c drivers/vhost/net.c > > 6438382dd9f8 tun: add missing rx stats accounting in tun_xdp_act > > 4efd09da0d49 tun: fix missing dropped counter in tun_xdp_act > > > > So the cause is likely in the code that generated the skb or something > > that modified it along the way. > > > > It could be helpful if it is possible to bisect further. Though odds > > are that the issue is between v6.6 and v6.7, not introduced in the > > stable backports after that. So it is a large target. > > Yeah, as I replied later to my original message - we actually also see > the issue on 6.6.9 as well but it looks slightly different. > > Actually while writing reply got 6.6.9 crashed too: > > [13330.391004] tun: unexpected GSO type: 0x4ec1c942, gso_size 20948, > hdr_len 3072 This looks like memory corruption > > Getting the exact line in skb_release_data that causes the Oops > > would be helpful too, e.g., > > > > gdb vmlinux > > list *(skb_release_data+0xb8) > > Unfortunately we do not collect kdumps so this is not going to be easy > :( We will investigate the possibility of getting the dump though. No need for a kdump. As long as you have the vmlinux of the kernel.