[cc += Christoph] On Tue, Mar 24, 2020 at 03:21:52PM +0000, Haeuptle, Michael wrote: > I'm running into a deadlock scenario between the hotplug, pcie and > vfio_pci driver when removing multiple devices in parallel. > This is happening on CentOS8 (4.18) with SPDK (spdk.io). I'm using the > latest pciehp code, the rest is all 4.18. > > The sequence that leads to the deadlock is as follows: > > The pciehp_ist() takes the reset_lock early in its processing. While > the pciehp_ist processing is progressing, vfio_pci calls > pci_try_reset_function() as part of vfio_pci_release or open. > The pci_try_reset_function() takes the device lock. > > Eventually, pci_try_reset_function() calls pci_reset_hotplug_slot() > which calls pciehp_reset_slot(). The pciehp_reset_slot() tries to take > the reset_lock but has to wait since it is already taken by pciehp_ist(). > > Eventually pciehp_ist calls pcie_stop_device() which calls > device_release_driver_internal(). This function also tries to take > device_lock causing the dead lock. The pci_dev_trylock() in pci_try_reset_function() looks questionable to me. It was added by commit b014e96d1abb ("PCI: Protect pci_error_handlers->reset_notify() usage with device_lock()") with the following rationale: Every method in struct device_driver or structures derived from it like struct pci_driver MUST provide exclusion vs the driver's ->remove() method, usually by using device_lock(). [...] Without this, ->reset_notify() may race with ->remove() calls, which can be easily triggered in NVMe. The intersection of drivers defining a ->reset_notify() hook and files invoking pci_try_reset_function() appears to be empty. So I don't quite understand the problem the commit sought to address. What am I missing? Thanks, Lukas