Re: [PATCH] PCI: Device removal rescan deadlock

Lukas Wunner <lukas@xxxxxxxxx> · Sat, 22 Sep 2018 10:11:17 +0200

On Fri, Sep 21, 2018 at 02:57:52PM -0600, Keith Busch wrote:
> The pci_rescan_remove_lock provides single-threaded enumeration. This
> can deadlock if a task requests this lock to enumerate below a device
> while another task wants to remove that device.
> 
> A simple example to create this dead lock from sysfs looks something
> like the following:
> 
>   # echo 1 > /sys/bus/pci/devices/<parent>/remove & \
>     echo 1 > /sys/bus/pci/devices/<child>/rescan &
> 
> If the parent removal locks the pci_rescan_remove_lock first, the child
> thread is blocked from forward progress, but the parent can't release
> the lock until the child thread releases the device.
> 
> Similar deadlocks are possible in the surprise removal path due to
> deadlocked interrupt handlers.
> 
> This patch provides an alternate locking method that exits in failure
> if the caller wants to scan below a device being removed.

For context, alternative patches to tackle this problem were submitted
earlier by Xiongfeng Wang and myself:

https://patchwork.ozlabs.org/patch/877835/
https://patchwork.ozlabs.org/patch/930403/

The solution you've found lools good in principle, this might work.

However.  I had withdrawn my solution because I came to the conclusion
that pci_lock_rescan_remove() is basically the equivalent to the
Big Kernel Lock in the PCI subsystem.  It's way too coarse-grained and
the proper solution, it seems, is to move to more fine-grained locking.

Specifically, device removal comprises two phases, unbinding of the
driver (pci_stop_bus_device()) and destruction of the pci_dev
(pci_remove_bus_device()).  Currently pci_lock_rescan_remove()
protects both phases.  If we make it so that it only protects the
second phase, such that the first phase runs lockless, those deadlocks
automatically go away.  I've explained this in a little more detail
in this reply to Mika:
https://www.spinics.net/lists/linux-pci/msg75960.html

Making unbinding of the driver lockless requires careful examination
of core PCI enumeration code.  E.g., you've moved the call to
pci_dev_assign_added() up in pci_stop_dev().  I don't know if that's
safe.  Changing the "added" flag has consequences which I haven't
fully understood yet, e.g. it's used to determine newly discovered
devices during rescan, but I recall noticing that it also has other
meanings.  This is historically grown code which needs to be untangled
to make it run lockless, but that takes time.

The problem is real, so the conundrum is whether to apply a fix now
which essentially amounts to duct tape, but helps users, or fixing
the root cause, which takes more time and would leave users exposed
to the problem for now.  Applying duct tape reduces pressure to
address the root cause and makes untangling more difficult
in the long run.

Thanks,

Lukas