Re: REGRESSION: RIP: 0010:skb_release_data+0xb8/0x1e0 in vhost/tun

Jaroslav Pulchart <jaroslav.pulchart@xxxxxxxxxxxx> · Tue, 26 Mar 2024 14:25:53 +0100

>
> On Mon, Mar 25, 2024 at 4:44 PM Igor Raits <igor@xxxxxxxxxxxx> wrote:
> >
> > Hello,
> >
> > On Fri, Mar 22, 2024 at 12:19 PM Igor Raits <igor@xxxxxxxxxxxx> wrote:
> > >
> > > Hi Jason,
> > >
> > > On Fri, Mar 22, 2024 at 9:39 AM Igor Raits <igor@xxxxxxxxxxxx> wrote:
> > > >
> > > > Hi Jason,
> > > >
> > > > On Fri, Mar 22, 2024 at 6:31 AM Jason Wang <jasowang@xxxxxxxxxx> wrote:
> > > > >
> > > > > On Thu, Mar 21, 2024 at 5:44 PM Igor Raits <igor@xxxxxxxxxxxx> wrote:
> > > > > >
> > > > > > Hello Jason & others,
> > > > > >
> > > > > > On Wed, Mar 20, 2024 at 10:33 AM Jason Wang <jasowang@xxxxxxxxxx> wrote:
> > > > > > >
> > > > > > > On Tue, Mar 19, 2024 at 9:15 PM Igor Raits <igor@xxxxxxxxxxxx> wrote:
> > > > > > > >
> > > > > > > > Hello Stefan,
> > > > > > > >
> > > > > > > > On Tue, Mar 19, 2024 at 2:12 PM Stefan Hajnoczi <stefanha@xxxxxxxxxx> wrote:
> > > > > > > > >
> > > > > > > > > On Tue, Mar 19, 2024 at 10:00:08AM +0100, Igor Raits wrote:
> > > > > > > > > > Hello,
> > > > > > > > > >
> > > > > > > > > > We have started to observe kernel crashes on 6.7.y kernels (atm we
> > > > > > > > > > have hit the issue 5 times on 6.7.5 and 6.7.10). On 6.6.9 where we
> > > > > > > > > > have nodes of cluster it looks stable. Please see stacktrace below. If
> > > > > > > > > > you need more information please let me know.
> > > > > > > > > >
> > > > > > > > > > We do not have a consistent reproducer but when we put some bigger
> > > > > > > > > > network load on a VM, the hypervisor's kernel crashes.
> > > > > > > > > >
> > > > > > > > > > Help is much appreciated! We are happy to test any patches.
> > > > > > > > >
> > > > > > > > > CCing Michael Tsirkin and Jason Wang for vhost_net.
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > [62254.167584] stack segment: 0000 [#1] PREEMPT SMP NOPTI
> > > > > > > > > > [62254.173450] CPU: 63 PID: 11939 Comm: vhost-11890 Tainted: G
> > > > > > > > > >    E      6.7.10-1.gdc.el9.x86_64 #1
> > > > > > > > >
> > > > > > > > > Are there any patches in this kernel?
> > > > > > > >
> > > > > > > > Only one, unrelated to this part. Removal of pr_err("EEVDF scheduling
> > > > > > > > fail, picking leftmost\n"); line (reported somewhere few months ago
> > > > > > > > and it was suggested workaround until proper solution comes).
> > > > > > >
> > > > > > > Btw, a bisection would help as well.
> > > > > >
> > > > > > In the end it seems like we don't really have "stable" setup, so
> > > > > > bisection looks to be useless but we did find few things meantime:
> > > > > >
> > > > > > 1. On 6.6.9 it crashes either with unexpected GSO type or usercopy:
> > > > > > Kernel memory exposure attempt detected from SLUB object
> > > > > > 'skbuff_head_cache'
> > > > >
> > > > > Do you have a full calltrace for this?
> > > >
> > > > I have shared it in one of the messages in this thread.
> > > > https://marc.info/?l=linux-virtualization&m=171085443512001&w=2
> > > >
> > > > > > 2. On 6.7.5, 6.7.10 and 6.8.1 it crashes with RIP:
> > > > > > 0010:skb_release_data+0xb8/0x1e0
> > > > >
> > > > > And for this?
> > > >
> > > > https://marc.info/?l=linux-netdev&m=171083870801761&w=2
> > > >
> > > > > > 3. It does NOT crash on 6.8.1 when VM does not have multi-queue setup
> > > > > >
> > > > > > Looks like the multi-queue setup (we have 2 interfaces × 3 virtio
> > > > > > queues for each) is causing problems as if we set only one queue for
> > > > > > each interface the issue is gone.
> > > > > > Maybe there is some race condition in __pfx_vhost_task_fn+0x10/0x10 or
> > > > > > somewhere around?
> > > > >
> > > > > I can't tell now, but it seems not because if we have 3 queue pairs we
> > > > > will have 3 vhost threads.
> > > > >
> > > > > > We have noticed that there are 3 of such functions
> > > > > > in the stacktrace that gave us hints about what we could try…
> > > > >
> > > > > Let's try to enable SLUB_DEBUG and KASAN to see if we can get
> > > > > something interesting.
> > > >
> > > > We were able to reproduce it even with 1 vhost queue... And now we
> > > > have slub_debug + kasan so I hopefully have more useful data for you
> > > > now.
> > > > I have attached it for better readability.
> > >
> > > Looks like we have found a "stable" kernel and that is 6.1.32. The
> > > 6.3.y is broken and we are testing 6.2.y now.
> > > My guess it would be related to virtio/vsock: replace virtio_vsock_pkt
> > > with sk_buff that was done around that time but we are going to test,
> > > bisect and let you know more.
> >
> > So we have been trying to bisect it but it is basically impossible for
> > us to do so as the ICE driver was quite broken for most of the release
> > cycle so we have no networking on 99% of the builds and we can't test
> > such a setup.
> > More specifically, the bug was introduced between 6.2 and 6.3 but we
> > could not get much further. The last good commit we were able to test
> > was f18f9845f2f10d3d1fc63e4ad16ee52d2d9292fa and then after 20 commits
> > where we had no networking we gave up.
> >
> > If you have some suspicious commit(s) we could revert - happy to test.
>
> Here is the is for the change since f18f9845f2f10d3d1fc63e4ad16ee52d2d9292fa:
>
> cbfbfe3aee71 tun: prevent negative ifindex
> b2f8323364ab tun: add __exit annotations to module exit func tun_cleanup()
> 6231e47b6fad tun: avoid high-order page allocation for packet header
> 4d016ae42efb Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
> 59eeb2329405 drivers: net: prevent tun_build_skb() to exceed the
> packet size limit
> 35b1b1fd9638 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
> ce7c7fef1473 net: tun: change tun_alloc_skb() to allow bigger paged allocations
> 9bc3047374d5 net: tun_chr_open(): set sk_uid from current_fsuid()
> 82b2bc279467 tun: Fix memory leak for detached NAPI queue.
> 6e98b09da931 Merge tag 'net-next-6.4' of
> git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
> de4f5fed3f23 iov_iter: add iter_iovec() helper
> 438b406055cd tun: flag the device as supporting FMODE_NOWAIT
> de4287336794 Daniel Borkmann says:
> a096ccca6e50 tun: tun_chr_open(): correctly initialize socket uid
> 66c0e13ad236 drivers: net: turn on XDP features
>
> The commit that touches the datapath are:
>
> 6231e47b6fad tun: avoid high-order page allocation for packet header
> 59eeb2329405 drivers: net: prevent tun_build_skb() to exceed the
> packet size limit
> ce7c7fef1473 net: tun: change tun_alloc_skb() to allow bigger paged allocations
> 82b2bc279467 tun: Fix memory leak for detached NAPI queue.
> de4f5fed3f23 iov_iter: add iter_iovec() helper
>
> I assume you didn't use NAPI mode, so 82b2bc279467 tun: Fix memory
> leak for detached NAPI queue doesn't make sense for us.
>
> The rest might be the bad commit if it is caused by a change of tun itself.
>
> btw I vaguely remember KASAN will report who did the allocation and
> who did the free. But it seems not in your KASAN log.
>
> Thanks
>
> >
> > Thanks again.
> >
>

Hello

We have one observation. The occurrence of the error depends on the
ring buffer size of physical network cards. We have two E810 Intel
cards bonded by two interfaces (em1 + p3p2, ice driver) into single
bon0. The bond0 is then linux bridged and/or ovs(witched) to VMs via
tun interfaces (both switch solutions have the same problem). VMs are
qemu-kvm instances and using vhost/virtio-net.

We see:
1/ The issue is triggered almost instantaneously when tx/rx ring
buffer is set to 2048 (our default)
ethtool -G em1 rx 2048 tx 2048
ethtool -G p3p1 rx 2048 tx 2048

2/ Similar issue is triggered when the tx/rx ring buffer is set to
4096: the host does not crash immediately, but some trace is shown
soon and later it gets into memory pressure and crashes.
ethtool -G em1 rx 4096 tx 4096
ethtool -G p3p1 rx 4096 tx 4096
See attached ring_4096.kasan.txt (vanila 6.8.1 with enabled KASAN) and
ring_4096.txt (vanila 6.8.1 without kasan)

3/ The system is stable or we just can-not trigger the issue if the
ring buffer is >= 6144.
ethtool -G em1 rx 7120 tx 7120
ethtool -G p3p1 rx 7120 tx 7120

could it be influenced by a some rate of dropped packets in the ring buffer?

# for i in em1 p3p1; do ethtool -S ${i} | grep dropped.nic; done
     rx_dropped.nic: 158225
     rx_dropped.nic: 74285

Best,
Jaroslav Pulchart
[ 1709.893147] ==================================================================
[ 1709.901084] BUG: KASAN: use-after-free in inet_gro_complete+0x352/0x3d0
[ 1709.908437] Read of size 2 at addr ffff88a02973e0d0 by task swapper/49/0
[ 1709.915684] 
[ 1709.917714] CPU: 49 PID: 0 Comm: swapper/49 Tainted: G            E      6.8.1-1.gdc.el9.kasan.x86_64 #1
[ 1709.927727] Hardware name: Dell Inc. PowerEdge R7525/0H3K7P, BIOS 2.14.1 12/17/2023
[ 1709.936226] Call Trace:
[ 1709.939364]  <IRQ>
[ 1709.941930]  dump_stack_lvl+0x33/0x50
[ 1709.946103]  print_address_description.constprop.0+0x2c/0x3e0
[ 1709.952608]  ? inet_gro_complete+0x352/0x3d0
[ 1709.957392]  print_report+0xb5/0x270
[ 1709.961850]  ? kasan_addr_to_slab+0x9/0xa0
[ 1709.966496]  kasan_report+0xac/0xe0
[ 1709.970501]  ? inet_gro_complete+0x352/0x3d0
[ 1709.975300]  inet_gro_complete+0x352/0x3d0
[ 1709.979909]  ? ice_alloc_rx_bufs+0x435/0x860 [ice]
[ 1709.985331]  napi_gro_complete.constprop.0+0x3cd/0x4c0
[ 1709.991094]  napi_gro_flush+0x1ad/0x370
[ 1709.995451]  napi_complete_done+0x433/0x710
[ 1710.000259]  ? __pfx_napi_complete_done+0x10/0x10
[ 1710.005509]  ice_napi_poll+0x23e/0x8b0 [ice]
[ 1710.010391]  ? __pfx_ice_napi_poll+0x10/0x10 [ice]
[ 1710.015790]  ? __pfx_load_balance+0x10/0x10
[ 1710.020587]  __napi_poll+0xa2/0x500
[ 1710.024581]  ? _raw_spin_trylock+0x6e/0x120
[ 1710.029267]  net_rx_action+0x421/0xb80
[ 1710.033551]  ? __pfx_net_rx_action+0x10/0x10
[ 1710.038331]  ? __pfx_sched_clock_cpu+0x10/0x10
[ 1710.043372]  ? _raw_spin_lock+0x81/0xe0
[ 1710.047711]  __do_softirq+0x19b/0x597
[ 1710.051977]  __irq_exit_rcu+0x125/0x170
[ 1710.056459]  common_interrupt+0x7d/0xa0
[ 1710.060796]  </IRQ>
[ 1710.063431]  <TASK>
[ 1710.066016]  asm_common_interrupt+0x22/0x40
[ 1710.071098] RIP: 0010:cpuidle_enter_state+0x176/0x300
[ 1710.076694] Code: 48 83 3c 03 00 0f 84 2c 01 00 00 83 e9 01 73 e4 48 83 c4 18 44 89 e0 5b 5d 41 5c 41 5d 41 5e 41 5f c3 cc cc cc cc fb 45 85 e4 <0f> 89 5f ff ff ff 4b 8d 44 6d 00 48 c7 43 18 00 00 00 00 48 c1 e0
[ 1710.096873] RSP: 0018:ffff889002017d90 EFLAGS: 00000202
[ 1710.102657] RAX: 0000000000000000 RBX: ffff88b003622800 RCX: 000000000000001f
[ 1710.110321] RDX: 1ffff112004c8821 RSI: 0000000000000031 RDI: ffff889002644108
[ 1710.117982] RBP: ffffffffb0a77140 R08: 0000018e1d929ae5 R09: ffffed12004c7df5
[ 1710.125658] R10: ffff88900263efab R11: 071c71c71c71c71c R12: 0000000000000001
[ 1710.133332] R13: 0000000000000001 R14: 0000018e1d929ae5 R15: 0000000000000000
[ 1710.140952]  cpuidle_enter+0x4a/0xa0
[ 1710.145209]  ? tick_nohz_idle_stop_tick+0x108/0x290
[ 1710.150553]  cpuidle_idle_call+0x174/0x1d0
[ 1710.155360]  ? __pfx_cpuidle_idle_call+0x10/0x10
[ 1710.160430]  ? irqtime_account_process_tick+0x1e1/0x340
[ 1710.166360]  do_idle+0xe1/0x150
[ 1710.170127]  cpu_startup_entry+0x51/0x60
[ 1710.174482]  start_secondary+0x209/0x280
[ 1710.178830]  ? __pfx_start_secondary+0x10/0x10
[ 1710.183732]  ? soft_restart_cpu+0x15/0x15
[ 1710.188510]  secondary_startup_64_no_verify+0x184/0x18b
[ 1710.194424]  </TASK>
[ 1710.197090] 
[ 1710.199090] The buggy address belongs to the physical page:
[ 1710.205139] page:ffffea0080a5cf80 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x202973e
[ 1710.215053] flags: 0x57ffffc0000000(node=1|zone=2|lastcpupid=0x1fffff)
[ 1710.222219] page_type: 0xffffffff()
[ 1710.226182] raw: 0057ffffc0000000 ffffea0043c8c188 ffff889002649650 0000000000000000
[ 1710.234425] raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
[ 1710.242687] page dumped because: kasan: bad access detected
[ 1710.248801] 
[ 1710.250748] Memory state around the buggy address:
[ 1710.256379]  ffff88a02973df80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 1710.264067]  ffff88a02973e000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[ 1710.271771] >ffff88a02973e080: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[ 1710.279675]                                                  ^
[ 1710.285978]  ffff88a02973e100: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[ 1710.293870]  ffff88a02973e180: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[ 1710.301579] ==================================================================
[ 1710.309286] Disabling lock debugging due to kernel taint
...
[ 2362.339090] general protection fault, probably for non-canonical address 0xdffffc000000885c: 0000 [#1] PREEMPT SMP KASAN NOPTI
[ 2362.347828] ------------[ cut here ]------------
[ 2362.350474] KASAN: probably user-memory-access in range [0x00000000000442e0-0x00000000000442e7]
[ 2362.355092] WARNING: CPU: 19 PID: 0 at kernel/sched/core.c:3389 set_task_cpu+0x62e/0x7c0
[ 2362.363786] CPU: 20 PID: 0 Comm: swapper/20 Tainted: G    B       E      6.8.1-1.gdc.el9.kasan.x86_64 #1
[ 2362.363839] Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: __schedule+0x14f0/0x1760

[ 1415.478903] Unloaded tainted modules: fjes(E):2 padlock_aes(E):3
[ 1415.529527] CPU: 32 PID: 11044 Comm: vhost-10998 Tainted: G            E      6.8.1-1.gdc.el9.x86_64 #1
[ 1415.539124] Hardware name: Dell Inc. PowerEdge R7525/0H3K7P, BIOS 2.14.1 12/17/2023
[ 1415.546995] RIP: 0010:kmem_cache_free+0x301/0x3d0
[ 1415.551921] Code: fd ff ff 80 3d 3a 9f f5 01 00 0f 85 bf fe ff ff 48 c7 c6 50 4d a7 8e 48 c7 c7 40 69 fd 8e c6 05 1f 9f f5 01 01 e8 1f 44 d3 ff <0f> 0b e9 9e fe ff ff 48 8d 42 ff e9 63 fd ff ff 4c 8d 68 ff e9 e3
[ 1415.571103] RSP: 0018:ffffab12de55bc80 EFLAGS: 00010282
[ 1415.576548] RAX: 0000000000000000 RBX: ffff91d2b4dd9800 RCX: 0000000000000000
[ 1415.583914] RDX: ffff91b27fcad800 RSI: ffff91b27fca0a40 RDI: ffff91b27fca0a40
[ 1415.591268] RBP: ffffab12de55bcc8 R08: 0000000000000000 R09: 00000000ffff7fff
[ 1415.598631] R10: ffffab12de55bb20 R11: ffffffff8fbe2968 R12: ffff91d334dd9800
[ 1415.606078] R13: 0000000000000000 R14: 0000000000000001 R15: 00000000000064e4
[ 1415.613445] FS:  00007f817fe32f80(0000) GS:ffff91b27fc80000(0000) knlGS:0000000000000000
[ 1415.621767] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1415.627747] CR2: 000000c1ad59b000 CR3: 0000014035e20001 CR4: 0000000000770ef0
[ 1415.635123] PKRU: 55555554
[ 1415.638069] Call Trace:
[ 1415.640757]  <TASK>
[ 1415.643098]  ? __warn+0x80/0x130
[ 1415.646560]  ? kmem_cache_free+0x301/0x3d0
[ 1415.650894]  ? report_bug+0x195/0x1a0
[ 1415.654793]  ? prb_read_valid+0x17/0x20
[ 1415.658868]  ? handle_bug+0x3c/0x70
[ 1415.662592]  ? exc_invalid_op+0x14/0x70
[ 1415.666665]  ? asm_exc_invalid_op+0x16/0x20
[ 1415.671089]  ? kmem_cache_free+0x301/0x3d0
[ 1415.675417]  ? skb_release_data+0x107/0x1e0
[ 1415.679843]  tun_do_read+0x68/0x1f0 [tun]
[ 1415.684105]  tun_recvmsg+0x7e/0x160 [tun]
[ 1415.688350]  handle_rx+0x3ab/0x750 [vhost_net]
[ 1415.693028]  ? init_numa_balancing+0xd7/0x1e0
[ 1415.697623]  vhost_worker+0x42/0x70 [vhost]
[ 1415.702042]  vhost_task_fn+0x4b/0xb0
[ 1415.705856]  ? finish_task_switch.isra.0+0x8f/0x2a0
[ 1415.710966]  ? __pfx_vhost_task_fn+0x10/0x10
[ 1415.715460]  ? __pfx_vhost_task_fn+0x10/0x10
[ 1415.719953]  ret_from_fork+0x2d/0x50
[ 1415.723740]  ? __pfx_vhost_task_fn+0x10/0x10
[ 1415.728215]  ret_from_fork_asm+0x1b/0x30
[ 1415.732347]  </TASK>
[ 1415.734730] ---[ end trace 0000000000000000 ]---