On Thu, 7 Dec 2023 22:38:23 +0000 Jim Harris <jim.harris@xxxxxxxxxxx> wrote: > I am seeing a deadlock using SPDK with hotplug detection using vfio-pci > and an SR-IOV enabled NVMe SSD. It is not clear if this deadlock is intended > or if it's a kernel bug. > > Note: SPDK uses DPDK's PCI device enumeration framework, so I'll reference > both SPDK and DPDK in this description. > > DPDK registers an eventfd with vfio for hotplug notifications. If the associated > device is removed (i.e. write 1 to its pci sysfs remove entry), vfio > writes to the eventfd, requesting DPDK to release the device. It does this > while holding the device_lock(), and then waits for completion. > > DPDK gets the notification, and passes it up to SPDK. SPDK does not release > the device immediately. It has some asynchronous operations that need to be > performed first, so it will release the device a bit later. > > But before the device is released, SPDK also triggers DPDK to do a sysfs scan > looking for newly inserted devices. Note that the removed device is not > completely removed yet from kernel PCI perspective - all of its sysfs entries > are still available, including sriov_numvfs. > > DPDK explicitly reads sriov_numvfs to see if the device is SR-IOV capable. > SPDK itself doesn't actually use this value, but it is part of the scan > triggered by SPDK and directly leads to the deadlock. sriov_numvfs_show() > deadlocks because it tries to hold device_lock() while reading the pci > device's pdev->sriov->num_VFs. > > We're able to workaround this in SPDK by deferring the sysfs scan if > a device removal is in process. And maybe that is what we are supposed to > be doing, to avoid this deadlock? > > Reference to SPDK issue, for some more details (plus simple repro stpes for > anyone already familiar with SPDK): https://github.com/spdk/spdk/issues/3205 device_lock() has been a recurring problem. We don't have a lot of leeway in how we support the driver remove callback, the device needs to be released. We can't return -EBUSY and I don't think we can drop the mutex while we're waiting on userspace. I've done some fix-ups in the past to use device_trylock() to avoid deadlocks, which might be an option here, ex. reading sriov_numvfs could return -EBUSY in this scenario. We keep running into these scenarios though and we might just need to pick a point at which we kill the user process holding the device. I'm open to suggestions. Thanks, Alex