On Thu, Dec 07, 2023 at 07:48:10PM -0400, Jason Gunthorpe wrote: > On Thu, Dec 07, 2023 at 04:21:48PM -0700, Alex Williamson wrote: > > On Thu, 7 Dec 2023 22:38:23 +0000 > > Jim Harris <jim.harris@xxxxxxxxxxx> wrote: > > > > device_lock() has been a recurring problem. We don't have a lot of > > leeway in how we support the driver remove callback, the device needs > > to be released. We can't return -EBUSY and I don't think we can drop > > the mutex while we're waiting on userspace. > > The mechanism of waiting in remove for userspace is inherently flawed, > it can never work fully correctly. :( I've hit this many times. > > Upon remove VFIO should immediately remove itself and leave behind a > non-functional file descriptor. Userspace should catch up eventually > and see it is toast. > > The kernel locking model just cannot support userspace delaying this > process. > > Jason Maybe for now we just whack this specific mole with a separate mutex for synchronizing access to sriov->num_VFs in the sysfs paths? Something like this (tested on my system): --- Author: Jim Harris <jim.harris@xxxxxxxxxxx> pci: sync sriov->num_VFs sysfs access with its own mutex If SR-IOV enabled device is held by vfio, and device is removed, vfio will hold device lock and notify userspace of the removal. If userspace reads sriov_numvfs sysfs entry, that thread will be blocked since sriov_numvfs_show() also tries to acquire the device lock. If that same thread is responsible for releasing the device to vfio, it results in a deadlock. So add a separate mutex, specifically for struct pci_sriov. Use this to synchronize accesses to sriov_numvfs in the sysfs paths. sriov_numvfs_store() will also still hold the device lock while configuring sriov. Fixes: 35ff867b765 ("PCI/IOV: Serialize sysfs sriov_numvfs reads vs writes") Signed-off-by: Jim Harris <jim.harris@xxxxxxxxxxx> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c index 25dbe85c4217..8910cf6c97be 100644 --- a/drivers/pci/iov.c +++ b/drivers/pci/iov.c @@ -398,9 +398,9 @@ static ssize_t sriov_numvfs_show(struct device *dev, u16 num_vfs; /* Serialize vs sriov_numvfs_store() so readers see valid num_VFs */ - device_lock(&pdev->dev); + mutex_lock(&pdev->sriov->lock); num_vfs = pdev->sriov->num_VFs; - device_unlock(&pdev->dev); + mutex_unlock(&pdev->sriov->lock); return sysfs_emit(buf, "%u\n", num_vfs); } @@ -427,6 +427,7 @@ static ssize_t sriov_numvfs_store(struct device *dev, return -ERANGE; device_lock(&pdev->dev); + mutex_lock(&pdev->sriov->lock); if (num_vfs == pdev->sriov->num_VFs) goto exit; @@ -468,6 +469,7 @@ static ssize_t sriov_numvfs_store(struct device *dev, num_vfs, ret); exit: + mutex_unlock(&pdev->sriov->lock); device_unlock(&pdev->dev); if (ret < 0) @@ -808,6 +810,7 @@ static int sriov_init(struct pci_dev *dev, int pos) nres++; } + mutex_init(&iov->lock); iov->pos = pos; iov->nres = nres; iov->ctrl = ctrl; diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index 5ecbcf041179..04e636ab50e5 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -313,6 +313,7 @@ struct pci_sriov { u16 subsystem_device; /* VF subsystem device */ resource_size_t barsz[PCI_SRIOV_NUM_BARS]; /* VF BAR size */ bool drivers_autoprobe; /* Auto probing of VFs by driver */ + struct mutex lock; /* for synchronizing num_VFs sysfs accesses */ }; #ifdef CONFIG_PCI_DOE