Re: Deadlock during PCIe hot remove and SPDK exit

fengnan chang <fengnanchang@xxxxxxxxx> · Tue, 3 Dec 2024 17:31:30 +0800

On Tue, Dec 3, 2024 at 1:31 PM Yi Liu <yi.l.liu@xxxxxxxxx> wrote:
>
> On 2024/12/2 11:25, fengnan chang wrote:
> > On Fri, Nov 29, 2024 at 4:00 PM Yi Liu <yi.l.liu@xxxxxxxxx> wrote:
> >>
> >> On 2024/11/28 11:50, fengnan chang wrote:
> >>>
> >>>
> >>>> 2024年11月27日 14:56，fengnan chang <fengnanchang@xxxxxxxxx> 写道：
> >>>>
> >>>> Dear PCI maintainers:
> >>>>     I'm having a deadlock issue, somewhat similar to a previous one https://lore.kernel.org/linux-pci/CS1PR8401MB0728FC6FDAB8A35C22BD90EC95F10@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/#t，; but my kernel (6.6.40) already included the fix f5eff55.
> >>
> >> The previous bug was solved by the below commit.
> >>
> >> commit f5eff5591b8f9c5effd25c92c758a127765f74c1
> >> Author: Lukas Wunner <lukas@xxxxxxxxx>
> >> Date:   Tue Apr 11 08:21:02 2023 +0200
> >>
> >>       PCI: pciehp: Fix AB-BA deadlock between reset_lock and device_lock
> >>
> >
> > Yes, my kernel version included this fix.
> >
> >>       In 2013, commits
> >>
> >>         2e35afaefe64 ("PCI: pciehp: Add reset_slot() method")
> >>         608c388122c7 ("PCI: Add slot reset option to pci_dev_reset()")
> >>
> >>       amended PCIe hotplug to mask Presence Detect Changed events during a
> >>       Secondary Bus Reset.  The reset thus no longer causes gratuitous slot
> >>       bringdown and bringup.
> >>
> >>       However the commits neglected to serialize reset with code paths reading
> >>       slot registers.  For instance, a slot bringup due to an earlier hotplug
> >>       event may see the Presence Detect State bit cleared during a concurrent
> >>       Secondary Bus Reset.
> >>
> >>       In 2018, commit
> >>
> >>         5b3f7b7d062b ("PCI: pciehp: Avoid slot access during reset")
> >>
> >>       retrofitted the missing locking.  It introduced a reset_lock which
> >>       serializes a Secondary Bus Reset with other parts of pciehp.
> >>
> >>       Unfortunately the locking turns out to be overzealous:  reset_lock is
> >>       held for the entire enumeration and de-enumeration of hotplugged devices,
> >>       including driver binding and unbinding.
> >>
> >>       Driver binding and unbinding acquires device_lock while the reset_lock
> >>       of the ancestral hotplug port is held.  A concurrent Secondary Bus Reset
> >>       acquires the ancestral reset_lock while already holding the device_lock.
> >>       The asymmetric locking order in the two code paths can lead to AB-BA
> >>       deadlocks.
> >>
> >>       Michael Haeuptle reports such deadlocks on simultaneous hot-removal and
> >>       vfio release (the latter implies a Secondary Bus Reset):
> >>
> >>         pciehp_ist()                                    # down_read(reset_lock)
> >>           pciehp_handle_presence_or_link_change()
> >>             pciehp_disable_slot()
> >>               __pciehp_disable_slot()
> >>                 remove_board()
> >>                   pciehp_unconfigure_device()
> >>                     pci_stop_and_remove_bus_device()
> >>                       pci_stop_bus_device()
> >>                         pci_stop_dev()
> >>                           device_release_driver()
> >>                             device_release_driver_internal()
> >>                               __device_driver_lock()    # device_lock()
> >>
> >>         SYS_munmap()
> >>           vfio_device_fops_release()
> >>             vfio_device_group_close()
> >>               vfio_device_close()
> >>                 vfio_device_last_close()
> >>
> >>
> >>>>     Here is my test process, I’m running kernel with 6.6.40 and SPDK v22.05:
> >>>>     1. SPDK use vfio driver to takeover two nvme disks, running some io in nvme.
> >>>>     2. pull out two nvme disks
> >>>>     3. Try to kill -9 SPDK process.
> >>>>     Then deadlock issue happened. For now I can 100% reproduce this problem. I’m not an export in PCI, but I did a brief analysis:
> >>>>     irq 149 thread take pci_rescan_remove_lock mutex lock, and wait for SPDK to release vfio.
> >>>>     irq 148 thread take reset_lock of ctrl A, and wait for psi_rescan_remove_lock
> >>>>     SPDK process try to release vfio driver, but wait for reset_lock of ctrl A.
> >>>>
> >>>>
> >>>> irq/149-pciehp stack, cat /proc/514/stack,
> >>>> [<0>] pciehp_unconfigure_device+0x48/0x160 // wait for pci_rescan_remove_lock
> >>>> [<0>] pciehp_disable_slot+0x6b/0x130       // hold reset_lock of ctrl A
> >>>> [<0>] pciehp_handle_presence_or_link_change+0x7d/0x4d0
> >>>> [<0>] pciehp_ist+0x236/0x260
> >>>> [<0>] irq_thread_fn+0x1b/0x60
> >>>> [<0>] irq_thread+0xed/0x190
> >>>> [<0>] kthread+0xe4/0x110
> >>>> [<0>] ret_from_fork+0x2d/0x50
> >>>> [<0>] ret_from_fork_asm+0x11/0x20
> >>>>
> >>>>
> >>>> irq/148-pciehp stack, cat /proc/513/stack
> >>>> [<0>] vfio_unregister_group_dev+0x97/0xe0 [vfio]     //wait for
> >>>
> >>> My mistake, this is wait for SPDK to release vfio device. This problem can reproduce in 6.12.
> >>> Besides, My college give me an idea, we can make vfio_device_fops_release be async, so when we close fd, it
> >>> won’t block, and when we close another fd, it will release  vfio device, this stack will not block too, then the deadlock disappears.
> >>>
> >>
> >> In the hotplug path, vfio needs to notify userspace to stop the usage of
> >> this device and release reference on the vfio_device. When the last
> >> refcount is released, the wait in the vfio_unregister_group_dev() will be
> >> unblocked. It is the vfio_device_fops_release() either userspace exits or
> >> userspace explicitly close the vfio device fd. Your below test patch moves
> >> the majority of the vfio_device_fops_release() out of the existing path.
> >> I don't see a reason why it can work so far.
> >
> > Forgive my poor English. Let me explain in more detail.
> > irq 149 thread take pci_rescan_remove_lock mutex lock, take reset_lock
> > of ctrl B and release reset_lock of ctrl B. Then wait for SPDK to
> > release vfio refcount connected to ctrl B.
> > irq 148 thread take reset_lock of ctrl A, and wait for psi_rescan_remove_lock.
>
> it's 148 which holds pci_rescan_remove_lock and wait for the last refcount
> of vfio_device.
>
> > SPDK process exit, and try to release vfio refcount connected to ctrl
> > A, but need to wait for reset_lock of ctrl A.  Since SPDK is blocked
> > here, it won’t release vfio refcount connected to ctrl B. So the
> > deadlock is happening.
>
> but I got you now. So the SPDK process exit thread closes the vfio_device
> A and B sequentially. The vfio_device A file release is blocked by the 149
> while 149 is blocked by the pci_rescan_remove_lock which is held by 148.
> 148 is waiting for the vfio_device B's refcount which is done in
> vfio_device B's file release.  The vfio_device B file release is not done
> because of the exit thread is dealing with vfio_device A. So this should be
> related to timing. Did you see any difference w.r.t. the order of plugging
> out the devices? Maybe you can try something different to double confirm
> it. e.g. plugging out the devices as the order of how they are listed in
> the QEMU cmdline.

Yes, if the order of plugging out the device changed, this problem
would also disappear.
But in the real world, we encounter problems because the ssd disk is
broken, so I couldn't
change the order.

>
> > After my patch. When SPDK exit it’s async to release vfio refcount
> > connected to ctrl A, it still will block, but SPDK can still can
> > release vfio refcount connected to ctrl B. Since irq 149 release
> > reset_lock of ctrl B, it won’t Block. After release vfio refcount
> > connected to ctrl B, irq 149 can finish it’s work, and will release
> > pci_rescan_remove_lock mutex lock, and then irq 148 can take
> > pci_rescan_remove_lock, then all is same as ctrl A.
> >
> >
> >>
> >> As the locking issue has been solved in the above commit, seems there is
> >> no deadlock with the reset_lock and device_lock. Can you confirm if the
> >> scenario can be reproduced with one device? Also, even with two devices,
> >> does killing the process matters or not?
> >
> > This problem is not  a deadlock with the reset_lock and device_lock.
> > The scenario can’t be reproduced with one
> > device, only two devices can reproduce this problem. Kill the process
> > may not be important, but
> > the interrupt thread is blocked too, and all lspci commands will
> > block, this affects some monitoring
> > programs.
>
> yes, this issue is not related to no response from userspace.
>
>
> --
> Regards,
> Yi Liu