On Mon, Sep 10, 2018 at 07:08:48PM +0200, Lukas Wunner wrote: > On Mon, Sep 10, 2018 at 10:45:28AM -0600, Keith Busch wrote: > > On Mon, Sep 10, 2018 at 06:09:26PM +0200, Lukas Wunner wrote: > > > On Mon, Sep 10, 2018 at 08:56:42AM -0600, Keith Busch wrote: > > > > The sysfs entries still function. Their actions are only temporarily > > > > stalled during error handling. Once the slot reset is called, the > > > > ctrl->pending_events is queried to take requested actions. > > > > > > Okay I see. Still, releasing the IRQ and requesting it again seems fairly > > > heavy-wheight. Why not just acquire ctrl->reset_lock? > > > > That was looking like a nice way to handle it, but it introduces > > circular locking between ctrl->reset_lock and pci_bus_sem: > > > > CPU A CPU B > > --------------------------------- ------------------------ > > pci_walk_bus pciehp_ist > > down_read(&pci_bus_sem) down_read(&ctrl->reset_lock); > > pcie_portdrv_error_detected pciehp_handle_presence_or_link_change > > pciehp_error_detected pciehp_unconfigure_device > > down_write(&ctrl->reset_lock) pci_stop_and_remove_bus_devicea > > down_write(&pci_bus_sem); > > Why is pciehp bringing down the slot? Is that in reaction to a > Link Down caused by the error? This could just be something that happens if a hotplug and error event occur about the same time. > Can this be solved with Sinan's > approach to check in pciehp whether PCI_EXP_DEVSTA_FED is set > and if so, waiting for it to be handled? That only helps if the downstream port detected a fatal error, but error handling happens for any device reported error. > FWIW you can use synchronize_irq() in pciehp_error_detected() > if you need to wait for the IRQ thread to stop before taking > the reset_lock. That would just be a different dead lock: pciehp_error_detected is holding a read lock on pci_bus_sem, and irq thread may request a write lock, so both threads are dead locked on each other.