Deadlock between reset_lock and pci_slot_mutex

Ilkka Koskinen <ilkka@xxxxxxxxxxxxxxxxxxxxxx> · Mon, 9 Sep 2024 12:44:19 -0700 (PDT)

Hi all,

We are seeing a deadlock between reset_lock and pci_slot_mutex when one 
injects an error with Intel PEI error injection card. It was initially 
reported with some older kernels but it was also reproduced on 6.11-rc6. 
Apparently, it requires FW first more of AER handling being set.

   Not tainted 6.11.0-rc6-orig+ #8
   "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
   task:kworker/0:2 state:D stack:0 pid:1003 tgid:1003 ppid:2 flags:0x00000008
   Workqueue: events aer_recover_work_func
   Call trace:
    __switch_to+0xc4/0xe8
    __schedule+0x280/0x748
    schedule+0x3c/0xe0
    schedule_preempt_disabled+0x2c/0x50
    rwsem_down_write_slowpath+0x1ec/0x6f0
    down_write+0xac/0xb8
    pciehp_reset_slot+0x60/0x178 			<-- ctrl->reset_lock
    pci_reset_hotplug_slot+0x54/0x90
    pci_slot_reset+0x138/0x1a8
    pci_bus_error_reset+0x110/0x158			<-- pci_slot_mutex
    aer_root_reset+0xbc/0x298
    pcie_do_recovery+0x2a0/0x3b8
    aer_recover_work_func+0x144/0x150
    process_one_work+0x184/0x420
    worker_thread+0x250/0x360
    kthread+0xfc/0x110
    ret_from_fork+0x10/0x20

   INFO: task irq/78-pciehp:1497 blocked for more than 122 seconds.
       Not tainted 6.11.0-rc6-orig+ #8
    "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
   task:irq/78-pciehp state:D stack:0 pid:1497 tgid:1497 ppid:2 flags:0x00000008
   Call trace:
    __switch_to+0xc4/0xe8
    __schedule+0x280/0x748
    schedule+0x3c/0xe0
    schedule_preempt_disabled+0x2c/0x50
    __mutex_lock.constprop.0+0x28c/0x960
    __mutex_lock_slowpath+0x1c/0x30
    mutex_lock+0x6c/0x88
    pci_dev_assign_slot+0x2c/0x88		<-- pci_slot_mutex
    pci_setup_device+0xfc/0x6f0
    pci_scan_single_device+0xd0/0x120
    pci_scan_slot+0x6c/0x200
    pciehp_configure_device+0x50/0x188
    pciehp_enable_slot+0x1b0/0x290
    pciehp_handle_presence_or_link_change+0xfc/0x208
    pciehp_ist+0x214/0x260
    irq_thread_fn+0x34/0xb8
    irq_thread+0x160/0x250			<-- ctrl->reset_lock
    kthread+0xfc/0x110
    ret_from_fork+0x10/0x20

I noticed Ian May reported two deadlocks a while ago [1]. The first issue 
got fixed but I'm wondering if the other one was patched and we're simply 
seeing a new, yet a similar one?

[1] https://lore.kernel.org/linux-pci/20200615143250.438252-1-ian.may@xxxxxxxxxxxxx/

Cheers, Ilkka