Hi Lukas, Thanks for your help on this issue. Here's the requested trace for the happy path (single device remove) and the instrumented code pieces for reference in case I'm doing something wrong with the instrumentation. There are 2 places where vfio tries to take the device lock. Once in pcie_try_reset_function and then later in pci_reset_bus. The number enclosed in {} is the thread id. The dev=<address> is the pointer to the struct device. As mentioned, this is the happy path with one device removal. When multiple devices are removed then execution piles up on the global remove_rescan lock and vfio most of the time gets the device lock first resulting in a dead lock. Let me know if you need more traces/details. -- Michael [ 667.067997] pcieport 0000:b4:0d.0: Slot(14): Card not present // pciehp_ist takes the device lock [ 667.079485] { 1917} !!! device_release_driver_internal: dev=00000000bb2b1fe3: [0000:c2:00.0] before device_lock [ 667.137592] Call Trace: [ 667.142448] dump_stack+0x5c/0x80 [ 667.149029] device_release_driver_internal.part.28+0x47/0x290 [ 667.160625] ? irq_thread_check_affinity+0xa0/0xa0 [ 667.170145] pci_stop_bus_device+0x69/0x90 [ 667.178281] pci_stop_and_remove_bus_device+0xe/0x20 [ 667.188145] pciehp_unconfigure_device+0x7c/0x130 [ 667.197492] __pciehp_disable_slot+0x48/0xd0 [ 667.205974] pciehp_handle_presence_or_link_change+0xdc/0x440 [ 667.217395] ? __synchronize_hardirq+0x43/0x50 [ 667.226228] pciehp_ist+0x1c9/0x1d0 [ 667.233159] ? irq_finalize_oneshot.part.45+0xf0/0xf0 [ 667.243199] irq_thread_fn+0x1f/0x60 [ 667.250300] ? irq_finalize_oneshot.part.45+0xf0/0xf0 [ 667.257356] { 8903} vfio_bar_restore: 0000:c2:00.0 reset recovery - restoring bars [ 667.260336] irq_thread+0x142/0x190 [ 667.260338] ? irq_forced_thread_fn+0x70/0x70 [ 667.260343] kthread+0x112/0x130 // vfio tries to take the lock as well [ 667.275760] vfio-pci 0000:c2:00.0: { 8903} @@@ pci_try_reset_function: dev=00000000bb2b1fe3 [0000:c2:00.0] before device_lock [ 667.282314] ? kthread_flush_work_fn+0x10/0x10 [ 667.282318] ret_from_fork+0x1f/0x40 // but pciehp_ist got it first [ 667.282337] { 1917} !!! device_release_driver_internal: dev=00000000bb2b1fe3: [0000:c2:00.0] after device_lock // This is the stack trace from above pcie_try_reset_function [ 667.290979] Call Trace: [ 667.290983] dump_stack+0x5c/0x80 [ 667.290988] pci_try_reset_function+0x52/0xf0 [ 667.290992] vfio_pci_disable+0x3cd/0x480 [ 667.290994] vfio_pci_release+0x42/0x50 [ 667.290995] vfio_device_fops_release+0x1e/0x40 [ 667.290998] __fput+0xa5/0x1d0 [ 667.291001] task_work_run+0x8a/0xb0 [ 667.291004] exit_to_usermode_loop+0xd3/0xe0 [ 667.291005] do_syscall_64+0xe1/0x100 [ 667.291007] entry_SYSCALL_64_after_hwframe+0x65/0xca // vfio did not get the lock since pciehp_ist/ device_release_driver_internal got it first [ 667.291017] vfio-pci 0000:c2:00.0: { 8903} @@@ pci_try_reset_function: dev=00000000bb2b1fe3 [0000:c2:00.0] DID NOT GET device_lock // vfio does a pci_reset_bus which calls pci_slot_trylock that tries to take the device lockl [ 667.351615] { 8903} !!! pci_slot_trylock: dev=00000000bb2b1fe3 [0000:c2:00.0] before device_lock, slot=000000000dbce595 [14] [ 667.355872] vfio-pci 0000:c2:00.0: { 1917} No device request channel registered, blocked until released by user // Stack trace for right before pci_slot_trylock. [ 667.377124] Call Trace: [ 667.377126] dump_stack+0x5c/0x80 [ 667.723398] pci_reset_bus+0x128/0x160 [ 667.730843] vfio_pci_disable+0x37b/0x480 [ 667.738808] ? vfio_pci_rw+0x80/0x80 [ 667.745907] vfio_pci_release+0x42/0x50 [ 667.753525] vfio_device_fops_release+0x1e/0x40 [ 667.762524] __fput+0xa5/0x1d0 [ 667.768587] task_work_run+0x8a/0xb0 [ 667.775686] exit_to_usermode_loop+0xd3/0xe0 [ 667.784166] do_syscall_64+0xe1/0x100 [ 667.791438] entry_SYSCALL_64_after_hwframe+0x65/0xca // vfio: pci_slot_trylock did not get the lock [ 667.931879] { 8903} !!! pci_slot_trylock: dev=00000000bb2b1fe3 [0000:c2:00.0] DID NOT GET device_lock [ 667.962175] iommu: Removing device 0000:c2:00.0 from group 12 [ 667.985035] vfio-pci 0000:c2:00.0: Refused to change power state, currently in D3 // pciehp_ist is done and releases the lock [ 667.999956] { 1917} !!! device_release_driver_internal: dev=00000000bb2b1fe3: [0000:c2:00.0] after device_unlock // not sure why we pciehp_ist is executed again... [ 668.020316] { 1917} !!! device_release_driver_internal: dev=00000000bb2b1fe3: [0000:c2:00.0] before device_lock [ 668.040381] CPU: 11 PID: 1917 Comm: irq/44-pciehp Kdump: loaded Tainted: G U O --------- - - 4.18.0 #104 [ 668.061476] Hardware name: Hewlett Packard Enterprise FALCON/FALCON, BIOS 66.14.01 2020_01_10 [ 668.078422] Call Trace: [ 668.083276] dump_stack+0x5c/0x80 [ 668.089857] device_release_driver_internal.part.28+0x47/0x290 [ 668.101450] ? irq_thread_check_affinity+0xa0/0xa0 [ 668.110969] bus_remove_device+0xf7/0x170 [ 668.118934] device_del+0x13b/0x340 [ 668.125861] pci_remove_bus_device+0x77/0x100 [ 668.134517] pciehp_unconfigure_device+0x7c/0x130 [ 668.143866] __pciehp_disable_slot+0x48/0xd0 [ 668.152348] pciehp_handle_presence_or_link_change+0xdc/0x440 [ 668.163766] ? __synchronize_hardirq+0x43/0x50 [ 668.172593] pciehp_ist+0x1c9/0x1d0 [ 668.179522] ? irq_finalize_oneshot.part.45+0xf0/0xf0 [ 668.189558] irq_thread_fn+0x1f/0x60 [ 668.196656] ? irq_finalize_oneshot.part.45+0xf0/0xf0 [ 668.206692] irq_thread+0x142/0x190 [ 668.213619] ? irq_forced_thread_fn+0x70/0x70 [ 668.222275] kthread+0x112/0x130 [ 668.228684] ? kthread_flush_work_fn+0x10/0x10 [ 668.237513] ret_from_fork+0x1f/0x40 [ 668.244627] { 1917} !!! device_release_driver_internal: dev=00000000bb2b1fe3: [0000:c2:00.0] after device_lock [ 668.264642] { 1917} !!! device_release_driver_internal: dev=00000000bb2b1fe3: [0000:c2:00.0] after device_unlock Instrumented code: int pci_try_reset_function(struct pci_dev *dev) { int rc; if (!dev->reset_fn) { return -ENOTTY; } pci_info(dev, "{%5d} @@@ pci_try_reset_function: dev=%p [%s] before device_lock\n", task_pid_nr(current),&dev->dev, dev->dev.kobj.name); dump_stack(); if (!pci_dev_trylock(dev)) { pci_info(dev, "{%5d} @@@ pci_try_reset_function: dev=%p [%s] DID NOT GET device_lock\n", task_pid_nr(current),&dev->dev, dev->dev.kobj.name); return -EAGAIN; } pci_info(dev, "{%5d} @@@ pci_try_reset_function: dev=%p [%s] after device_lock\n", task_pid_nr(current),&dev->dev, dev->dev.kobj.name); pci_dev_save_and_disable(dev); rc = __pci_reset_function_locked(dev); pci_dev_restore(dev); pci_dev_unlock(dev); pci_info(dev, "{%5d} @@@ pci_try_reset_function: dev=%p [%s] after device_unlock\n",task_pid_nr(current), &dev->dev, dev->dev.kobj.name); return rc; } void device_release_driver_internal(struct device *dev, struct device_driver *drv, struct device *parent) { if (parent && dev->bus->need_parent_lock) { device_lock(parent); } printk(KERN_ERR "{%5d} !!! device_release_driver_internal: dev=%p: [%s] before device_lock\n",task_pid_nr(current), dev, dev->kobj.name); dump_stack(); device_lock(dev); printk(KERN_ERR "{%5d} !!! device_release_driver_internal: dev=%p: [%s] after device_lock\n",task_pid_nr(current), dev, dev->kobj.name); if (!drv || drv == dev->driver) { __device_release_driver(dev, parent); } device_unlock(dev); printk(KERN_ERR "{%5d} !!! device_release_driver_internal: dev=%p: [%s] after device_unlock\n",task_pid_nr(current), dev, dev->kobj.name); if (parent && dev->bus->need_parent_lock) device_unlock(parent); } static int pci_slot_trylock(struct pci_slot *slot) { struct pci_dev *dev; list_for_each_entry(dev, &slot->bus->devices, bus_list) { if (!dev->slot || dev->slot != slot) continue; printk(KERN_INFO "{%5d} !!! pci_slot_trylock: dev=%p [%s] before device_lock, slot=%p [%s]\n",task_pid_nr(current), &dev->dev, dev->dev.kobj.name, slot, slot->kobj.name); dump_stack(); if (!pci_dev_trylock(dev)) { printk(KERN_INFO "{%5d} !!! pci_slot_trylock: dev=%p [%s] DID NOT GET device_lock\n",task_pid_nr(current), &dev->dev, dev->dev.kobj.name); goto unlock; } printk(KERN_INFO "{%5d} !!! pci_slot_trylock: dev=%p [%s] after device_lock\n", task_pid_nr(current),&dev->dev, dev->dev.kobj.name); if (dev->subordinate) { if (!pci_bus_trylock(dev->subordinate)) { pci_dev_unlock(dev); goto unlock; } } } printk(KERN_INFO "{%5d} !!! pci_slot_trylock: dev=%p [%s] success\n", task_pid_nr(current), &dev->dev, dev->dev.kobj.name); return 1; unlock: list_for_each_entry_continue_reverse(dev, &slot->bus->devices, bus_list) { if (!dev->slot || dev->slot != slot) continue; if (dev->subordinate) pci_bus_unlock(dev->subordinate); pci_dev_unlock(dev); } return 0; } -----Original Message----- From: Lukas Wunner [mailto:lukas@xxxxxxxxx] Sent: Sunday, March 29, 2020 9:44 AM To: Haeuptle, Michael <michael.haeuptle@xxxxxxx> Cc: linux-pci@xxxxxxxxxxxxxxx; michaelhaeuptle@xxxxxxxxx; Christoph Hellwig <hch@xxxxxx> Subject: Re: Deadlock during PCIe hot remove On Tue, Mar 24, 2020 at 03:21:52PM +0000, Haeuptle, Michael wrote: > I'm running into a deadlock scenario between the hotplug, pcie and > vfio_pci driver when removing multiple devices in parallel. > This is happening on CentOS8 (4.18) with SPDK (spdk.io). I'm using the > latest pciehp code, the rest is all 4.18. > > The sequence that leads to the deadlock is as follows: > > The pciehp_ist() takes the reset_lock early in its processing. > While the pciehp_ist processing is progressing, vfio_pci calls > pci_try_reset_function() as part of vfio_pci_release or open. > The pci_try_reset_function() takes the device lock. > > Eventually, pci_try_reset_function() calls pci_reset_hotplug_slot() > which calls pciehp_reset_slot(). The pciehp_reset_slot() tries to take > the reset_lock but has to wait since it is already taken by > pciehp_ist(). > > Eventually pciehp_ist calls pcie_stop_device() which calls > device_release_driver_internal(). This function also tries to take > device_lock causing the dead lock. > > Here's the kernel stack trace when the deadlock occurs: > > [root@localhost ~]# cat /proc/8594/task/8598/stack [<0>] > pciehp_reset_slot+0xa5/0x220 [<0>] > pci_reset_hotplug_slot.cold.72+0x20/0x36 > [<0>] pci_dev_reset_slot_function+0x72/0x9b > [<0>] __pci_reset_function_locked+0x15b/0x190 > [<0>] pci_try_reset_function.cold.77+0x9b/0x108 > [<0>] vfio_pci_disable+0x261/0x280 > [<0>] vfio_pci_release+0xcb/0xf0 > [<0>] vfio_device_fops_release+0x1e/0x40 > [<0>] __fput+0xa5/0x1d0 > [<0>] task_work_run+0x8a/0xb0 > [<0>] exit_to_usermode_loop+0xd3/0xe0 > [<0>] do_syscall_64+0xe1/0x100 > [<0>] entry_SYSCALL_64_after_hwframe+0x65/0xca > [<0>] 0xffffffffffffffff There's something I don't understand here: The device_lock exists per-device. The reset_lock exists per hotplug-capable downstream port. Now if I understand correctly, in the stacktrace above, the device_lock of the *downstream port* is acquired and then its reset_lock is tried to acquire, which however is already held by pciehp_ist(). You're saying that pciehp_ist() is handling removal of the endpoint device *below* the downstream port, which means that device_release_driver_internal() tries to acquire the device_lock of the *endpoint device*. That's a separate lock than the one acquired by vfio_pci_disable() before calling pci_try_reset_function()! So I don't quite understand how there can be a deadlock. Could you instrument the code with a few printk()'s and dump_stack()'s to show exactly which device's device_lock is acquired from where? Note that device_release_driver_internal() also acquires the parent's device_lock and this would indeed be the one of the downstream port. However commit 8c97a46af04b ("driver core: hold dev's parent lock when needed") constrained that to USB devices. So the parent lock shouldn't be taken for PCI devices. That commit went into v4.18, please double-check that you have it in your tree. Thanks, Lukas