Re: [PATCHv2 16/20] PCI/pciehp: Implement error handling callbacks

Keith Busch <keith.busch@xxxxxxxxx> · Mon, 10 Sep 2018 11:22:46 -0600

On Mon, Sep 10, 2018 at 07:08:48PM +0200, Lukas Wunner wrote:
> On Mon, Sep 10, 2018 at 10:45:28AM -0600, Keith Busch wrote:
> > On Mon, Sep 10, 2018 at 06:09:26PM +0200, Lukas Wunner wrote:
> > > On Mon, Sep 10, 2018 at 08:56:42AM -0600, Keith Busch wrote:
> > > > The sysfs entries still function. Their actions are only temporarily
> > > > stalled during error handling. Once the slot reset is called, the
> > > > ctrl->pending_events is queried to take requested actions.
> > > 
> > > Okay I see.  Still, releasing the IRQ and requesting it again seems fairly
> > > heavy-wheight.  Why not just acquire ctrl->reset_lock?
> > 
> > That was looking like a nice way to handle it, but it introduces
> > circular locking between ctrl->reset_lock and pci_bus_sem:
> > 
> > CPU A                               CPU B
> > ---------------------------------   ------------------------
> > pci_walk_bus                        pciehp_ist
> >  down_read(&pci_bus_sem)             down_read(&ctrl->reset_lock);
> >   pcie_portdrv_error_detected         pciehp_handle_presence_or_link_change
> >    pciehp_error_detected               pciehp_unconfigure_device
> >     down_write(&ctrl->reset_lock)       pci_stop_and_remove_bus_devicea
> >                                          down_write(&pci_bus_sem);
> 
> Why is pciehp bringing down the slot?  Is that in reaction to a
> Link Down caused by the error?  

This could just be something that happens if a hotplug and error event
occur about the same time.

> Can this be solved with Sinan's
> approach to check in pciehp whether PCI_EXP_DEVSTA_FED is set
> and if so, waiting for it to be handled?

That only helps if the downstream port detected a fatal error, but
error handling happens for any device reported error.

> FWIW you can use synchronize_irq() in pciehp_error_detected()
> if you need to wait for the IRQ thread to stop before taking
> the reset_lock.

That would just be a different dead lock: pciehp_error_detected is
holding a read lock on pci_bus_sem, and irq thread may request a
write lock, so both threads are dead locked on each other.