6.9-rc2: Deadlock on unbinding network device from a driver (regression)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

After updating to 6.9-rc2 I can no longer unbind device from the igc
driver. "echo" into "unbind" file hangs, and via sysrq "w" I get this
call trace:

    [   84.553112] Call Trace:
    [   84.553118]  <TASK>
    [   84.553123]  __schedule+0x23b/0x5c0
    [   84.553134]  schedule+0x27/0xa0
    [   84.553142]  schedule_preempt_disabled+0x15/0x30
    [   84.553152]  __mutex_lock.constprop.0+0x34c/0x6a0
    [   84.553165]  unregister_netdevice_notifier+0x25/0xc0
    [   84.553178]  netdev_trig_deactivate+0x1e/0x60 [ledtrig_netdev]
    [   84.553195]  led_trigger_set+0x105/0x340
    [   84.553206]  led_classdev_unregister+0x4a/0x110
    [   84.553219]  release_nodes+0x3d/0xb0
    [   84.553229]  devres_release_all+0x8c/0xc0
    [   84.553238]  device_del+0x27a/0x3f0
    [   84.553248]  unregister_netdevice_many_notify+0x46a/0x6a0
    [   84.553260]  unregister_netdevice_queue+0xf0/0x130
    [   84.553271]  unregister_netdev+0x1c/0x30
    [   84.553280]  igc_remove+0xe3/0x1d0 [igc]
    [   84.553298]  pci_device_remove+0x3f/0xb0
    [   84.553308]  device_release_driver_internal+0x19f/0x200
    [   84.553320]  unbind_store+0xa1/0xb0
    [   84.553329]  kernfs_fop_write_iter+0x11f/0x200
    [   84.553341]  vfs_write+0x293/0x460
    [   84.553351]  ksys_write+0x6f/0xf0
    [   84.553360]  do_syscall_64+0x87/0x170
    [   84.553368]  ? syscall_exit_work+0xf3/0x120
    [   84.553378]  ? syscall_exit_to_user_mode+0x69/0x220
    [   84.553389]  ? do_syscall_64+0x96/0x170
    [   84.553397]  ? do_syscall_64+0x96/0x170
    [   84.553404]  ? do_syscall_64+0x96/0x170
    [   84.553412]  ? do_syscall_64+0x96/0x170
    [   84.553420]  ? __irq_exit_rcu+0x4b/0xb0
    [   84.553429]  entry_SYSCALL_64_after_hwframe+0x71/0x79
    [   84.553439] RIP: 0033:0x7b46ae7c5ee4
    [   84.553446] RSP: 002b:00007ffe580c2dd8 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
    [   84.553460] RAX: ffffffffffffffda RBX: 000000000000000d RCX: 00007b46ae7c5ee4
    [   84.553474] RDX: 000000000000000d RSI: 00006458ac50b4b0 RDI: 0000000000000001
    [   84.553487] RBP: 00007ffe580c2e00 R08: 0000000000000073 R09: 0000000000000001
    [   84.553500] R10: 0000000000000000 R11: 0000000000000202 R12: 000000000000000d
    [   84.553514] R13: 00006458ac50b4b0 R14: 00007b46ae8965c0 R15: 00007b46ae893f20
    [   84.553528]  </TASK>

It worked fine on 6.8.4.

Similar issue happens on few other systems, including one with Realtek
RTL8111/8168/8411 device, so it may be not specific to the igc driver
but some common API (LED trigger?). The issue does not affect a system
with e1000e driver. 

Lockdep says:

    [   18.589322] ======================================================
    [   18.589329] WARNING: possible circular locking dependency detected
    [   18.589335] 6.9.0-rc2-1.qubes.fc32.x86_64 #378 Not tainted
    [   18.589340] ------------------------------------------------------
    [   18.589347] prepare-suspend/1145 is trying to acquire lock:
    [   18.589352] ffff897494bc37b8 (&led_cdev->trigger_lock){+.+.}-{3:3}, at: led_classdev_unregister+0x32/0x110
    [   18.589367]
    [   18.589367] but task is already holding lock:
    [   18.589373] ffffffffb034dfa8 (rtnl_mutex){+.+.}-{3:3}, at: unregister_netdev+0xe/0x20
    [   18.589384]
    [   18.589384] which lock already depends on the new lock.
    [   18.589384]
    [   18.589391]
    [   18.589391] the existing dependency chain (in reverse order) is:
    [   18.589399]
    [   18.589399] -> #1 (rtnl_mutex){+.+.}-{3:3}:
    [   18.589407]        __mutex_lock+0xb2/0xbd0
    [   18.589413]        set_device_name+0x2d/0x140 [ledtrig_netdev]
    [   18.589423]        netdev_trig_activate+0x1a6/0x220 [ledtrig_netdev]
    [   18.589432]        led_trigger_set+0x20f/0x340
    [   18.589438]        led_trigger_register+0x16d/0x1a0
    [   18.589443]        do_one_initcall+0x6f/0x3d0
    [   18.589451]        do_init_module+0x60/0x240
    [   18.589459]        init_module_from_file+0x86/0xc0
    [   18.589465]        idempotent_init_module+0x126/0x2c0
    [   18.589471]        __x64_sys_finit_module+0x5a/0xb0
    [   18.589477]        do_syscall_64+0x96/0x190
    [   18.589482]        entry_SYSCALL_64_after_hwframe+0x71/0x79
    [   18.589490]
    [   18.589490] -> #0 (&led_cdev->trigger_lock){+.+.}-{3:3}:
    [   18.589498]        __lock_acquire+0x13e7/0x2180
    [   18.589505]        lock_acquire+0xd5/0x2f0
    [   18.589510]        down_write+0x2a/0xc0
    [   18.589515]        led_classdev_unregister+0x32/0x110
    [   18.589522]        devres_release_all+0xb5/0x110
    [   18.589530]        device_del+0x275/0x3f0
    [   18.589535]        unregister_netdevice_many_notify+0x5ba/0x870
    [   18.589543]        unregister_netdevice_queue+0xf3/0x130
    [   18.589549]        unregister_netdev+0x18/0x20
    [   18.589555]        igc_remove+0xe1/0x1c0 [igc]
    [   18.589566]        pci_device_remove+0x3b/0xb0
    [   18.589574]        device_release_driver_internal+0x1a5/0x210
    [   18.589581]        unbind_store+0x9d/0xb0
    [   18.589587]        kernfs_fop_write_iter+0x15b/0x210
    [   18.589595]        vfs_write+0x2bd/0x560
    [   18.589601]        ksys_write+0x71/0xf0
    [   18.589608]        do_syscall_64+0x96/0x190
    [   18.589614]        entry_SYSCALL_64_after_hwframe+0x71/0x79
    [   18.589620]
    [   18.589620] other info that might help us debug this:
    [   18.589620]
    [   18.589628]  Possible unsafe locking scenario:
    [   18.589628]
    [   18.589635]        CPU0                    CPU1
    [   18.589640]        ----                    ----
    [   18.589645]   lock(rtnl_mutex);
    [   18.589650]                                lock(&led_cdev->trigger_lock);
    [   18.589657]                                lock(rtnl_mutex);
    [   18.589664]   lock(&led_cdev->trigger_lock);
    [   18.589670]
    [   18.589670]  *** DEADLOCK ***
    [   18.589670]
    [   18.589676] 4 locks held by prepare-suspend/1145:
    [   18.589682]  #0: ffff8974873a7420 (sb_writers#3){.+.+}-{0:0}, at: ksys_write+0x71/0xf0
    [   18.589693]  #1: ffff897495886288 (&of->mutex){+.+.}-{3:3}, at: kernfs_fop_write_iter+0x114/0x210[   18.589704]  #2: ffff8974820991b0 (&dev->mutex){....}-{3:3}, at: device_release_driver_internal+0x39/0x210
    [   18.589715]  #3: ffffffffb034dfa8 (rtnl_mutex){+.+.}-{3:3}, at: unregister_netdev+0xe/0x20
    [   18.589726]
    [   18.589726] stack backtrace:
    [   18.589731] CPU: 1 PID: 1145 Comm: prepare-suspend Not tainted 6.9.0-rc2-1.qubes.fc32.x86_64 #378
    [   18.589741] Hardware name: Xen HVM domU, BIOS 4.17.3 03/12/2024
    [   18.589748] Call Trace:
    [   18.589752]  <TASK>
    [   18.589755]  dump_stack_lvl+0x73/0xb0
    [   18.589761]  check_noncircular+0x148/0x160
    [   18.589766]  ? stack_trace_save+0x4a/0x70
    [   18.589773]  __lock_acquire+0x13e7/0x2180
    [   18.589780]  lock_acquire+0xd5/0x2f0
    [   18.589786]  ? led_classdev_unregister+0x32/0x110
    [   18.589793]  down_write+0x2a/0xc0
    [   18.589798]  ? led_classdev_unregister+0x32/0x110
    [   18.589804]  led_classdev_unregister+0x32/0x110
    [   18.589811]  devres_release_all+0xb5/0x110
    [   18.589816]  device_del+0x275/0x3f0
    [   18.589821]  unregister_netdevice_many_notify+0x5ba/0x870
    [   18.589829]  unregister_netdevice_queue+0xf3/0x130
    [   18.589835]  unregister_netdev+0x18/0x20
    [   18.589840]  igc_remove+0xe1/0x1c0 [igc]
    [   18.589850]  pci_device_remove+0x3b/0xb0
    [   18.589855]  device_release_driver_internal+0x1a5/0x210
    [   18.589861]  unbind_store+0x9d/0xb0
    [   18.589867]  kernfs_fop_write_iter+0x15b/0x210
    [   18.589874]  vfs_write+0x2bd/0x560
    [   18.589880]  ksys_write+0x71/0xf0
    [   18.589886]  do_syscall_64+0x96/0x190
    [   18.589891]  ? find_held_lock+0x2b/0x80
    [   18.589896]  ? lock_release+0x143/0x2c0
    [   18.589902]  ? do_user_addr_fault+0x354/0x8a0
    [   18.589909]  ? exc_page_fault+0x126/0x260
    [   18.589916]  entry_SYSCALL_64_after_hwframe+0x71/0x79
    [   18.589922] RIP: 0033:0x76426194fee4
    [   18.589927] Code: c7 00 16 00 00 00 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 80 3d 85 74 0d 00 00 74 13 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 55 48 89 e5 48 83 ec 20 48 89
    [   18.589946] RSP: 002b:00007ffe69a0ca98 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
    [   18.589955] RAX: ffffffffffffffda RBX: 000000000000000d RCX: 000076426194fee4
    [   18.589963] RDX: 000000000000000d RSI: 000058ae60024480 RDI: 0000000000000001
    [   18.589971] RBP: 00007ffe69a0cac0 R08: 0000000000000000 R09: 0000000000000001
    [   18.589979] R10: 0000000000000004 R11: 0000000000000202 R12: 000000000000000d
    [   18.589987] R13: 000058ae60024480 R14: 0000764261a205c0 R15: 0000764261a1df20
    [   18.589997]  </TASK>


This is happening in a HVM domain on Xen, with PCI passthrough of
relevant devices, but I don't think it's related to the issue.

There is some more details on
https://github.com/QubesOS/qubes-issues/issues/9096.


#regzbot introduced: v6.8.4..v6.9-rc2

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

Attachment: signature.asc
Description: PGP signature


[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Security]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [ECOS]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux