Re: Deadlock during PCIe hot remove

Lukas Wunner <lukas@xxxxxxxxx> · Sun, 29 Mar 2020 15:04:20 +0200

On Wed, Mar 25, 2020 at 11:40:18AM +0100, Christoph Hellwig wrote:
> On Tue, Mar 24, 2020 at 05:15:34PM +0100, Lukas Wunner wrote:
> > The pci_dev_trylock() in pci_try_reset_function() looks questionable
> > to me.  It was added by commit b014e96d1abb ("PCI: Protect
> > pci_error_handlers->reset_notify() usage with device_lock()")
> > with the following rationale:
> > 
> >     Every method in struct device_driver or structures derived from it like
> >     struct pci_driver MUST provide exclusion vs the driver's ->remove()
> >     method, usually by using device_lock().
> >     [...]
> >     Without this, ->reset_notify() may race with ->remove() calls, which
> >     can be easily triggered in NVMe.
> > 
> > The intersection of drivers defining a ->reset_notify() hook and files
> > invoking pci_try_reset_function() appears to be empty.  So I don't quite
> > understand the problem the commit sought to address.  What am I missing?
> 
> No driver defines ->reset_notify as that has been split into
> ->reset_prepare and ->reset_done a while ago, and plenty of drivers
> define those.  And we can't call into drivers unless we know the driver
> actually still is bound to the device, which is why we need the locking.

Sure, you need to hold the driver in place while you're invoking one of
its callbacks.  But is it really necessary to hold the device lock while
performing the actual reset?  That locking seems awfully coarse-grained.

Do you see any potential problem in pushing down the pci_dev_lock() and
pci_dev_unlock() calls into pci_dev_save_and_disable() and
pci_dev_restore()?  I.e, acquire the lock for the invocation of
->reset_prepare() and ->reset_done() and release it immediately
afterwards?

That would seem to fix the deadlock Michael reported.

Of course that could result in ->reset_prepare() being invoked but
->reset_done() being not invoked if the driver is no longer bound.
Or in ->reset_done() being called for a different driver if the
device was rebound in the meantime.  Would this cause issues?

Thanks,

Lukas