Re: Deadlock during PCIe hot remove and SPDK exit

Yi Liu <yi.l.liu@xxxxxxxxx> · Fri, 29 Nov 2024 16:05:18 +0800

On 2024/11/28 11:50, fengnan chang wrote:


2024年11月27日 14:56，fengnan chang <fengnanchang@xxxxxxxxx> 写道：

Dear PCI maintainers:
   I'm having a deadlock issue, somewhat similar to a previous one https://lore.kernel.org/linux-pci/CS1PR8401MB0728FC6FDAB8A35C22BD90EC95F10@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/#t，; but my kernel (6.6.40) already included the fix f5eff55.

The previous bug was solved by the below commit.

commit f5eff5591b8f9c5effd25c92c758a127765f74c1
Author: Lukas Wunner <lukas@xxxxxxxxx>
Date:   Tue Apr 11 08:21:02 2023 +0200

    PCI: pciehp: Fix AB-BA deadlock between reset_lock and device_lock

    In 2013, commits

      2e35afaefe64 ("PCI: pciehp: Add reset_slot() method")
      608c388122c7 ("PCI: Add slot reset option to pci_dev_reset()")

    amended PCIe hotplug to mask Presence Detect Changed events during a
    Secondary Bus Reset.  The reset thus no longer causes gratuitous slot
    bringdown and bringup.

    However the commits neglected to serialize reset with code paths reading
    slot registers.  For instance, a slot bringup due to an earlier hotplug
    event may see the Presence Detect State bit cleared during a concurrent
    Secondary Bus Reset.

    In 2018, commit

      5b3f7b7d062b ("PCI: pciehp: Avoid slot access during reset")

    retrofitted the missing locking.  It introduced a reset_lock which
    serializes a Secondary Bus Reset with other parts of pciehp.

    Unfortunately the locking turns out to be overzealous:  reset_lock is
    held for the entire enumeration and de-enumeration of hotplugged devices,
    including driver binding and unbinding.

    Driver binding and unbinding acquires device_lock while the reset_lock
    of the ancestral hotplug port is held.  A concurrent Secondary Bus Reset
    acquires the ancestral reset_lock while already holding the device_lock.
    The asymmetric locking order in the two code paths can lead to AB-BA
    deadlocks.

    Michael Haeuptle reports such deadlocks on simultaneous hot-removal and
    vfio release (the latter implies a Secondary Bus Reset):

      pciehp_ist()                                    # down_read(reset_lock)
        pciehp_handle_presence_or_link_change()
          pciehp_disable_slot()
            __pciehp_disable_slot()
              remove_board()
                pciehp_unconfigure_device()
                  pci_stop_and_remove_bus_device()
                    pci_stop_bus_device()
                      pci_stop_dev()
                        device_release_driver()
                          device_release_driver_internal()
                            __device_driver_lock()    # device_lock()

      SYS_munmap()
        vfio_device_fops_release()
          vfio_device_group_close()
            vfio_device_close()
              vfio_device_last_close()


   Here is my test process, I’m running kernel with 6.6.40 and SPDK v22.05:
   1. SPDK use vfio driver to takeover two nvme disks, running some io in nvme.
   2. pull out two nvme disks
   3. Try to kill -9 SPDK process.
   Then deadlock issue happened. For now I can 100% reproduce this problem. I’m not an export in PCI, but I did a brief analysis:
   irq 149 thread take pci_rescan_remove_lock mutex lock, and wait for SPDK to release vfio.
   irq 148 thread take reset_lock of ctrl A, and wait for psi_rescan_remove_lock
   SPDK process try to release vfio driver, but wait for reset_lock of ctrl A.


irq/149-pciehp stack, cat /proc/514/stack,
[<0>] pciehp_unconfigure_device+0x48/0x160 // wait for pci_rescan_remove_lock
[<0>] pciehp_disable_slot+0x6b/0x130       // hold reset_lock of ctrl A
[<0>] pciehp_handle_presence_or_link_change+0x7d/0x4d0
[<0>] pciehp_ist+0x236/0x260
[<0>] irq_thread_fn+0x1b/0x60
[<0>] irq_thread+0xed/0x190
[<0>] kthread+0xe4/0x110
[<0>] ret_from_fork+0x2d/0x50
[<0>] ret_from_fork_asm+0x11/0x20


irq/148-pciehp stack, cat /proc/513/stack
[<0>] vfio_unregister_group_dev+0x97/0xe0 [vfio]     //wait for

My mistake, this is wait for SPDK to release vfio device. This problem can reproduce in 6.12.
Besides, My college give me an idea, we can make vfio_device_fops_release be async, so when we close fd, it
won’t block, and when we close another fd, it will release  vfio device, this stack will not block too, then the deadlock disappears.


In the hotplug path, vfio needs to notify userspace to stop the usage of
this device and release reference on the vfio_device. When the last
refcount is released, the wait in the vfio_unregister_group_dev() will be
unblocked. It is the vfio_device_fops_release() either userspace exits or
userspace explicitly close the vfio device fd. Your below test patch moves
the majority of the vfio_device_fops_release() out of the existing path.
I don't see a reason why it can work so far.

As the locking issue has been solved in the above commit, seems there is
no deadlock with the reset_lock and device_lock. Can you confirm if the
scenario can be reproduced with one device? Also, even with two devices,
does killing the process matters or not?

Here is my test patch, cc some vfio guys:

diff --git a/drivers/vfio/vfio.h b/drivers/vfio/vfio.h
index 50128da18bca..4ebe154a4ae5 100644
--- a/drivers/vfio/vfio.h
+++ b/drivers/vfio/vfio.h
@@ -19,6 +19,7 @@ struct vfio_container;
  struct vfio_device_file {
         struct vfio_device *device;
         struct vfio_group *group;
+       struct work_struct      release_work;

         u8 access_granted;
         u32 devid; /* only valid when iommufd is valid */
diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
index a5a62d9d963f..47e3e3f73d70 100644
--- a/drivers/vfio/vfio_main.c
+++ b/drivers/vfio/vfio_main.c
@@ -487,6 +487,22 @@ static bool vfio_assert_device_open(struct vfio_device *device)
         return !WARN_ON_ONCE(!READ_ONCE(device->open_count));
  }

+static void vfio_fops_release_work(struct work_struct *work)
+{
+       struct vfio_device_file *df =
+               container_of(work, struct vfio_device_file, release_work);
+       struct vfio_device *device = df->device;
+
+       if (df->group)
+               vfio_df_group_close(df);
+       else
+               vfio_df_unbind_iommufd(df);
+
+       vfio_device_put_registration(device);
+
+       kfree(df);
+}
+
  struct vfio_device_file *
  vfio_allocate_device_file(struct vfio_device *device)
  {
@@ -497,6 +513,7 @@ vfio_allocate_device_file(struct vfio_device *device)
                 return ERR_PTR(-ENOMEM);

         df->device = device;
+       INIT_WORK(&df->release_work, vfio_fops_release_work);
         spin_lock_init(&df->kvm_ref_lock);

         return df;
@@ -628,16 +645,8 @@ static inline void vfio_device_pm_runtime_put(struct vfio_device *device)
  static int vfio_device_fops_release(struct inode *inode, struct file *filep)
  {
         struct vfio_device_file *df = filep->private_data;
-       struct vfio_device *device = df->device;

-       if (df->group)
-               vfio_df_group_close(df);
-       else
-               vfio_df_unbind_iommufd(df);
-
-       vfio_device_put_registration(device);
-
-       kfree(df);
+       schedule_work(&df->release_work);

         return 0;
  }


[<0>] vfio_pci_core_unregister_device+0x19/0x80 [vfio_pci_core]
[<0>] vfio_pci_remove+0x15/0x20 [vfio_pci]
[<0>] pci_device_remove+0x39/0xb0
[<0>] device_release_driver_internal+0xad/0x120
[<0>] pci_stop_bus_device+0x5d/0x80
[<0>] pci_stop_and_remove_bus_device+0xe/0x20
[<0>] pciehp_unconfigure_device+0x91/0x160   //hold pci_rescan_remove_lock, release reset_lock of ctrl B
[<0>] pciehp_disable_slot+0x6b/0x130
[<0>] pciehp_handle_presence_or_link_change+0x7d/0x4d0
[<0>] pciehp_ist+0x236/0x260             //hold reset_lock of ctrl B
[<0>] irq_thread_fn+0x1b/0x60
[<0>] irq_thread+0xed/0x190
[<0>] kthread+0xe4/0x110
[<0>] ret_from_fork+0x2d/0x50
[<0>] ret_from_fork_asm+0x11/0x20


SPDK stack, cat /proc/166634/task/167181/stack
[<0>] down_write_nested+0x1b7/0x1c0            //wait for reset_lock of ctrl A.
[<0>] pciehp_reset_slot+0x58/0x160
[<0>] pci_reset_hotplug_slot+0x3b/0x60
[<0>] pci_reset_bus_function+0x3b/0xb0
[<0>] __pci_reset_function_locked+0x3e/0x60
[<0>] vfio_pci_core_disable+0x3ce/0x400 [vfio_pci_core]
[<0>] vfio_pci_core_close_device+0x67/0xc0 [vfio_pci_core]
[<0>] vfio_df_close+0x79/0xd0 [vfio]
[<0>] vfio_df_group_close+0x36/0x70 [vfio]
[<0>] vfio_device_fops_release+0x20/0x40 [vfio]
[<0>] __fput+0xec/0x290
[<0>] task_work_run+0x61/0x90
[<0>] do_exit+0x39c/0xc40
[<0>] do_group_exit+0x33/0xa0
[<0>] get_signal+0xd84/0xd90
[<0>] arch_do_signal_or_restart+0x2a/0x260
[<0>] exit_to_user_mode_prepare+0x1c7/0x240
[<0>] syscall_exit_to_user_mode+0x2a/0x60
[<0>] do_syscall_64+0x3e/0x90




--
Regards,
Yi Liu