Re: Need to scale down the time to release multiple vfio devices

Alex Williamson <alex.williamson@xxxxxxxxxx> · Wed, 31 Oct 2018 15:37:06 -0600

On Wed, 31 Oct 2018 14:04:46 +0100
Christian Ehrhardt <christian.ehrhardt@xxxxxxxxxxxxx> wrote:
...
> With that in place I went back to my 4 modes of the simplified testcase.
> - Threaded FD closes were sped up to ~1.3 seconds
> - Forked FD closes were sped up to ~1.4 seconds
> - sequential close and exiting the process still took ~18 seconds.
> 
> It is clear why the sequential loop takes that time, as each of the
> close call will clear the FD when returning to userspace.
> 
> The process exit (that also matches sending a sigkill to the process
> as libvirt does) will also take the full 18 seconds as it seems to
> serialize the FD cleanup from task_work_run.
> 
> At some point in the future we might make qemu smarter to drop devices
> concurrently as in my test, but until we have some kernel change
> allowing to de-serialize the waits userspace efforts are in vain for
> now.
> 
> I wanted to ask all of you, but mostly Alex TBH, if there is an idea
> how to either allow the resets to be concurrent?
> Or if that is impossible due to some parts of the PCI spec if there
> would be a way to coalesce multiple into just one wait.

We can certainly allow concurrent resets, there's nothing in the PCI
spec preventing that.  This serialization came about because vfio-pci
initially used an atomic for reference counting of devices, but that
was racy as it doesn't really prevent the device from being re-opened
while we're doing the final release cleanup, or opened again before
we've completed initialization.  The global mutex was a simple solution
without really considering that resets could be slow.  There was however
a bit of a benefit to the global mutex for cases where we needed to do
a bus reset with multiple affected devices, ie. where we call
vfio_pci_try_bus_reset().  Serializing everyone there means that we
don't have, for example, device A and device B, which are both affected
by the same bus reset trying to perform the operation at the same time
and neither succeeding.  Based on your patch, I take it that you're
mostly concerned about the case of only having a single device on the
bus such that pci_try_reset_function() takes the bus reset path, as I'd
expect using Telsa GPUs which don't have an audio function on the same
bus.  That patch of course allows a new open of the device while it's
being reset, which is not good.

There's an untested patch below that removes driver_lock and instead
replaces it with a per-device reflck mutex, which seems like a step in
the right direction.  I'll do some testing on it, but maybe you'll have
an opportunity first.  For the case of a single device per bus where we
can use pci_try_reset_function(), I think it should resolve the issue.
The vfio_pci_try_bus_reset() path is updated as well, but I'm afraid it
just exposes us to the race I describe above.  We're optimizing the
parallel-izing of the reset perhaps ahead of the effectiveness of the
reset and I'm not sure yet if that's the best solution.  I'll keep
thinking about it.  Thanks,

Alex


commit d763ff034e08f6f7146a2a5bb90aacb66dcb3c8c
Author: Alex Williamson <alex.williamson@xxxxxxxxxx>
Date:   Wed Oct 31 13:22:33 2018 -0600

    vfio/pci: Parallelize device open and release
    
    In commit 61d792562b53 ("vfio-pci: Use mutex around open, release,
    and remove") a mutex was added to freeze the refcnt for a device so
    that we can handle errors and perform bus resets on final close.
    However, bus resets can be rather slow and a global mutex here is
    undesirable.  Convert to a per-device mutex, acquiring each necessary
    mutex for all affected devices.
    
    Reported-by: Christian Ehrhardt <christian.ehrhardt@xxxxxxxxxxxxx>
    Signed-off-by: Alex Williamson <alex.williamson@xxxxxxxxxx>

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 50cdedfca9fe..66274adb407e 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -56,8 +56,6 @@ module_param(disable_idle_d3, bool, S_IRUGO | S_IWUSR);
 MODULE_PARM_DESC(disable_idle_d3,
 		 "Disable using the PCI D3 low power state for idle, unused devices");
 
-static DEFINE_MUTEX(driver_lock);
-
 static inline bool vfio_vga_disabled(void)
 {
 #ifdef CONFIG_VFIO_PCI_VGA
@@ -393,14 +391,14 @@ static void vfio_pci_release(void *device_data)
 {
 	struct vfio_pci_device *vdev = device_data;
 
-	mutex_lock(&driver_lock);
+	mutex_lock(&vdev->reflck);
 
 	if (!(--vdev->refcnt)) {
 		vfio_spapr_pci_eeh_release(vdev->pdev);
 		vfio_pci_disable(vdev);
 	}
 
-	mutex_unlock(&driver_lock);
+	mutex_unlock(&vdev->reflck);
 
 	module_put(THIS_MODULE);
 }
@@ -413,7 +411,7 @@ static int vfio_pci_open(void *device_data)
 	if (!try_module_get(THIS_MODULE))
 		return -ENODEV;
 
-	mutex_lock(&driver_lock);
+	mutex_lock(&vdev->reflck);
 
 	if (!vdev->refcnt) {
 		ret = vfio_pci_enable(vdev);
@@ -424,7 +422,7 @@ static int vfio_pci_open(void *device_data)
 	}
 	vdev->refcnt++;
 error:
-	mutex_unlock(&driver_lock);
+	mutex_unlock(&vdev->reflck);
 	if (ret)
 		module_put(THIS_MODULE);
 	return ret;
@@ -1222,6 +1220,7 @@ static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	vdev->pdev = pdev;
 	vdev->irq_type = VFIO_PCI_NUM_IRQS;
 	mutex_init(&vdev->igate);
+	mutex_init(&vdev->reflck);
 	spin_lock_init(&vdev->irqlock);
 	mutex_init(&vdev->ioeventfds_lock);
 	INIT_LIST_HEAD(&vdev->ioeventfds_list);
@@ -1322,14 +1321,16 @@ static struct pci_driver vfio_pci_driver = {
 
 struct vfio_devices {
 	struct vfio_device **devices;
+	struct vfio_pci_device *base_vdev;
 	int cur_index;
 	int max_index;
 };
 
-static int vfio_pci_get_devs(struct pci_dev *pdev, void *data)
+static int vfio_pci_get_unused_devs(struct pci_dev *pdev, void *data)
 {
 	struct vfio_devices *devs = data;
 	struct vfio_device *device;
+	struct vfio_pci_device *vdev;
 
 	if (devs->cur_index == devs->max_index)
 		return -ENOSPC;
@@ -1343,16 +1344,33 @@ static int vfio_pci_get_devs(struct pci_dev *pdev, void *data)
 		return -EBUSY;
 	}
 
+	vdev = vfio_device_data(device);
+
+	/* reflck is already held on base_vdev */
+	if (vdev != devs->base_vdev && !mutex_trylock(&vdev->reflck)) {
+		vfio_device_put(device);
+		return -EBUSY;
+	}
+
+	/* Only collect unused devices */
+	if (vdev->refcnt) {
+		if (vdev != devs->base_vdev)
+			mutex_unlock(&vdev->reflck);
+		vfio_device_put(device);
+		return -EBUSY;
+	}
+
 	devs->devices[devs->cur_index++] = device;
 	return 0;
 }
 
 /*
- * Attempt to do a bus/slot reset if there are devices affected by a reset for
- * this device that are needs_reset and all of the affected devices are unused
- * (!refcnt).  Callers are required to hold driver_lock when calling this to
- * prevent device opens and concurrent bus reset attempts.  We prevent device
- * unbinds by acquiring and holding a reference to the vfio_device.
+ * Attempt to do a bus/slot reset if there are devices affected by a reset
+ * for this device that are "needs_reset" and all of the affected devices
+ * are unused (!refcnt).  Callers are required to hold vdev->reflck when
+ * calling this to prevent concurrent device opens.  reflck is acquired and
+ * released on all collateral devices.  A vfio_device reference is also
+ * acquired for each affected device to prevent unbinds.
  *
  * NB: vfio-core considers a group to be viable even if some devices are
  * bound to drivers like pci-stub or pcieport.  Here we require all devices
@@ -1361,9 +1379,9 @@ static int vfio_pci_get_devs(struct pci_dev *pdev, void *data)
  */
 static void vfio_pci_try_bus_reset(struct vfio_pci_device *vdev)
 {
-	struct vfio_devices devs = { .cur_index = 0 };
+	struct vfio_devices devs = { .cur_index = 0, .base_vdev = vdev };
 	int i = 0, ret = -EINVAL;
-	bool needs_reset = false, slot = false;
+	bool slot = false;
 	struct vfio_pci_device *tmp;
 
 	if (!pci_probe_reset_slot(vdev->pdev->slot))
@@ -1381,28 +1399,40 @@ static void vfio_pci_try_bus_reset(struct vfio_pci_device *vdev)
 		return;
 
 	if (vfio_pci_for_each_slot_or_bus(vdev->pdev,
-					  vfio_pci_get_devs, &devs, slot))
+					  vfio_pci_get_unused_devs,
+					  &devs, slot))
 		goto put_devs;
 
+	/* Does at least one need a reset? */
 	for (i = 0; i < devs.cur_index; i++) {
 		tmp = vfio_device_data(devs.devices[i]);
-		if (tmp->needs_reset)
-			needs_reset = true;
-		if (tmp->refcnt)
-			goto put_devs;
+		if (tmp->needs_reset) {
+			ret = pci_reset_bus(vdev->pdev);
+			break;
+		}
 	}
 
-	if (needs_reset)
-		ret = pci_reset_bus(vdev->pdev);
-
 put_devs:
 	for (i = 0; i < devs.cur_index; i++) {
 		tmp = vfio_device_data(devs.devices[i]);
-		if (!ret)
+
+		/*
+		 * If reset was successful, affected devices no longer need
+		 * a reset and we should return all the collateral devices
+		 * to low power.  If not successful, we either didn't reset
+		 * the bus or timed out waiting for it, so let's not touch
+		 * the power state.
+		 */
+		if (!ret) {
 			tmp->needs_reset = false;
 
-		if (!tmp->refcnt && !disable_idle_d3)
-			pci_set_power_state(tmp->pdev, PCI_D3hot);
+			if (tmp != vdev && !disable_idle_d3)
+				pci_set_power_state(tmp->pdev, PCI_D3hot);
+		}
+
+		/* Release reflck only on collateral devices */
+		if (tmp != vdev)
+			mutex_unlock(&tmp->reflck);
 
 		vfio_device_put(devs.devices[i]);
 	}
diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
index cde3b5d3441a..666f6085bb72 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -104,6 +104,7 @@ struct vfio_pci_device {
 	bool			needs_reset;
 	bool			nointx;
 	struct pci_saved_state	*pci_saved_state;
+	struct mutex		reflck;
 	int			refcnt;
 	int			ioeventfds_nr;
 	struct eventfd_ctx	*err_trigger;