Re: Need to scale down the time to release multiple vfio devices

Alex Williamson <alex.williamson@xxxxxxxxxx> · Thu, 1 Nov 2018 17:20:22 -0600

On Thu, 1 Nov 2018 14:20:31 -0600
Alex Williamson <alex.williamson@xxxxxxxxxx> wrote:

> On Wed, 31 Oct 2018 15:37:06 -0600
> Alex Williamson <alex.williamson@xxxxxxxxxx> wrote:
> 
> > On Wed, 31 Oct 2018 14:04:46 +0100
> > Christian Ehrhardt <christian.ehrhardt@xxxxxxxxxxxxx> wrote:
> > ...  
> > > With that in place I went back to my 4 modes of the simplified testcase.
> > > - Threaded FD closes were sped up to ~1.3 seconds
> > > - Forked FD closes were sped up to ~1.4 seconds
> > > - sequential close and exiting the process still took ~18 seconds.
> > > 
> > > It is clear why the sequential loop takes that time, as each of the
> > > close call will clear the FD when returning to userspace.
> > > 
> > > The process exit (that also matches sending a sigkill to the process
> > > as libvirt does) will also take the full 18 seconds as it seems to
> > > serialize the FD cleanup from task_work_run.
> > > 
> > > At some point in the future we might make qemu smarter to drop devices
> > > concurrently as in my test, but until we have some kernel change
> > > allowing to de-serialize the waits userspace efforts are in vain for
> > > now.
> > > 
> > > I wanted to ask all of you, but mostly Alex TBH, if there is an idea
> > > how to either allow the resets to be concurrent?
> > > Or if that is impossible due to some parts of the PCI spec if there
> > > would be a way to coalesce multiple into just one wait.    
> > 
> > We can certainly allow concurrent resets, there's nothing in the PCI
> > spec preventing that.  This serialization came about because vfio-pci
> > initially used an atomic for reference counting of devices, but that
> > was racy as it doesn't really prevent the device from being re-opened
> > while we're doing the final release cleanup, or opened again before
> > we've completed initialization.  The global mutex was a simple solution
> > without really considering that resets could be slow.  There was however
> > a bit of a benefit to the global mutex for cases where we needed to do
> > a bus reset with multiple affected devices, ie. where we call
> > vfio_pci_try_bus_reset().  Serializing everyone there means that we
> > don't have, for example, device A and device B, which are both affected
> > by the same bus reset trying to perform the operation at the same time
> > and neither succeeding.  Based on your patch, I take it that you're
> > mostly concerned about the case of only having a single device on the
> > bus such that pci_try_reset_function() takes the bus reset path, as I'd
> > expect using Telsa GPUs which don't have an audio function on the same
> > bus.  That patch of course allows a new open of the device while it's
> > being reset, which is not good.
> > 
> > There's an untested patch below that removes driver_lock and instead
> > replaces it with a per-device reflck mutex, which seems like a step in
> > the right direction.  I'll do some testing on it, but maybe you'll have
> > an opportunity first.  For the case of a single device per bus where we
> > can use pci_try_reset_function(), I think it should resolve the issue.
> > The vfio_pci_try_bus_reset() path is updated as well, but I'm afraid it
> > just exposes us to the race I describe above.  We're optimizing the
> > parallel-izing of the reset perhaps ahead of the effectiveness of the
> > reset and I'm not sure yet if that's the best solution.  I'll keep
> > thinking about it.  Thanks,  
> 
> I modified your test program so that I could use a device with five
> functions, all in the same group, none of which support a function
> level reset.  This is the scenario where using a per device mutex fails
> because no individual function can ever acquire the mutex for all the
> other functions and we never perform the bus reset when they're all
> released.  In the existing implementation, the release of the last
> device on the bus is able to perform the reset as the other devices
> are blocked from being re-opened by that global mutex.  Thus the patch I
> proposed works for your case, but it's a regression when there are
> dependencies between devices in order to perform that bus reset.  I'm
> not sure how to resolve that. Thanks,

One more try, lightly tested, hopefully the best of both worlds with a
mutex per bus/slot so that multiple buses/slots can be reset in
parallel, but the devices affected by those resets, per bus/slot, are
serialized.  Now my multi-function device gets a bus reset with the
parallel unit tests.  Thanks,

Alex

commit dc05ffc2f5a666a68b547f326e4c6061abf9f89a
Author: Alex Williamson <alex.williamson@xxxxxxxxxx>
Date:   Wed Oct 31 13:22:33 2018 -0600

    vfio/pci: Parallelize device open and release
    
    In commit 61d792562b53 ("vfio-pci: Use mutex around open, release, and
    remove") a mutex was added to freeze the refcnt for a device so that
    we can handle errors and perform bus resets on final close.  However,
    bus resets can be rather slow and a global mutex here is undesirable.
    A per-device mutex provides the best granularity, but then our chances
    of triggering a bus/slot reset with multiple affected devices is slim
    when devices are released in parallel.  Instead create a reflck object
    shared among all devices under the same bus or slot, allowing devices
    on independent buses to be released in parallel while serializing per
    bus/slot.
    
    Reported-by: Christian Ehrhardt <christian.ehrhardt@xxxxxxxxxxxxx>
    Signed-off-by: Alex Williamson <alex.williamson@xxxxxxxxxx>

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 50cdedfca9fe..8751ea302466 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -56,8 +56,6 @@ module_param(disable_idle_d3, bool, S_IRUGO | S_IWUSR);
 MODULE_PARM_DESC(disable_idle_d3,
 		 "Disable using the PCI D3 low power state for idle, unused devices");
 
-static DEFINE_MUTEX(driver_lock);
-
 static inline bool vfio_vga_disabled(void)
 {
 #ifdef CONFIG_VFIO_PCI_VGA
@@ -393,14 +391,14 @@ static void vfio_pci_release(void *device_data)
 {
 	struct vfio_pci_device *vdev = device_data;
 
-	mutex_lock(&driver_lock);
+	mutex_lock(&vdev->reflck->lock);
 
 	if (!(--vdev->refcnt)) {
 		vfio_spapr_pci_eeh_release(vdev->pdev);
 		vfio_pci_disable(vdev);
 	}
 
-	mutex_unlock(&driver_lock);
+	mutex_unlock(&vdev->reflck->lock);
 
 	module_put(THIS_MODULE);
 }
@@ -413,7 +411,7 @@ static int vfio_pci_open(void *device_data)
 	if (!try_module_get(THIS_MODULE))
 		return -ENODEV;
 
-	mutex_lock(&driver_lock);
+	mutex_lock(&vdev->reflck->lock);
 
 	if (!vdev->refcnt) {
 		ret = vfio_pci_enable(vdev);
@@ -424,7 +422,7 @@ static int vfio_pci_open(void *device_data)
 	}
 	vdev->refcnt++;
 error:
-	mutex_unlock(&driver_lock);
+	mutex_unlock(&vdev->reflck->lock);
 	if (ret)
 		module_put(THIS_MODULE);
 	return ret;
@@ -1187,6 +1185,9 @@ static const struct vfio_device_ops vfio_pci_ops = {
 	.request	= vfio_pci_request,
 };
 
+static int vfio_pci_reflck_attach(struct vfio_pci_device *vdev);
+static void vfio_pci_reflck_put(struct vfio_pci_reflck *reflck);
+
 static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 {
 	struct vfio_pci_device *vdev;
@@ -1233,6 +1234,14 @@ static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 		return ret;
 	}
 
+	ret = vfio_pci_reflck_attach(vdev);
+	if (ret) {
+		vfio_del_group_dev(&pdev->dev);
+		vfio_iommu_group_put(group, &pdev->dev);
+		kfree(vdev);
+		return ret;
+	}
+
 	if (vfio_pci_is_vga(pdev)) {
 		vga_client_register(pdev, vdev, NULL, vfio_pci_set_vga_decode);
 		vga_set_legacy_decoding(pdev,
@@ -1264,6 +1273,8 @@ static void vfio_pci_remove(struct pci_dev *pdev)
 	if (!vdev)
 		return;
 
+	vfio_pci_reflck_put(vdev->reflck);
+
 	vfio_iommu_group_put(pdev->dev.iommu_group, &pdev->dev);
 	kfree(vdev->region);
 	mutex_destroy(&vdev->ioeventfds_lock);
@@ -1320,16 +1331,97 @@ static struct pci_driver vfio_pci_driver = {
 	.err_handler	= &vfio_err_handlers,
 };
 
+static DEFINE_MUTEX(reflck_lock);
+
+static struct vfio_pci_reflck *vfio_pci_reflck_alloc(void)
+{
+	struct vfio_pci_reflck *reflck;
+
+	reflck = kzalloc(sizeof(*reflck), GFP_KERNEL);
+	if (!reflck)
+		return ERR_PTR(-ENOMEM);
+
+	kref_init(&reflck->kref);
+	mutex_init(&reflck->lock);
+
+	return reflck;
+}
+
+static void vfio_pci_reflck_get(struct vfio_pci_reflck *reflck)
+{
+	kref_get(&reflck->kref);
+}
+
+static int vfio_pci_reflck_find(struct pci_dev *pdev, void *data)
+{
+	struct vfio_pci_reflck **preflck = data;
+	struct vfio_device *device;
+	struct vfio_pci_device *vdev;
+
+	device = vfio_device_get_from_dev(&pdev->dev);
+	if (!device)
+		return 0;
+
+	if (pci_dev_driver(pdev) != &vfio_pci_driver) {
+		vfio_device_put(device);
+		return 0;
+	}
+
+	vdev = vfio_device_data(device);
+
+	if (vdev->reflck) {
+		vfio_pci_reflck_get(vdev->reflck);
+		*preflck = vdev->reflck;
+		vfio_device_put(device);
+		return 1;
+	}
+
+	vfio_device_put(device);
+	return 0;
+}
+
+static int vfio_pci_reflck_attach(struct vfio_pci_device *vdev)
+{
+	bool slot = !pci_probe_reset_slot(vdev->pdev->slot);
+
+	mutex_lock(&reflck_lock);
+
+	if (pci_is_root_bus(vdev->pdev->bus) ||
+	    vfio_pci_for_each_slot_or_bus(vdev->pdev, vfio_pci_reflck_find,
+					  &vdev->reflck, slot) <= 0)
+		vdev->reflck = vfio_pci_reflck_alloc();
+
+	mutex_unlock(&reflck_lock);
+
+	return IS_ERR(vdev->reflck) ? PTR_ERR(vdev->reflck) : 0;
+}
+
+static void vfio_pci_reflck_release(struct kref *kref)
+{
+	struct vfio_pci_reflck *reflck = container_of(kref,
+						      struct vfio_pci_reflck,
+						      kref);
+
+	kfree(reflck);
+	mutex_unlock(&reflck_lock);
+}
+
+static void vfio_pci_reflck_put(struct vfio_pci_reflck *reflck)
+{
+	kref_put_mutex(&reflck->kref, vfio_pci_reflck_release, &reflck_lock);
+}
+
 struct vfio_devices {
 	struct vfio_device **devices;
 	int cur_index;
 	int max_index;
 };
 
-static int vfio_pci_get_devs(struct pci_dev *pdev, void *data)
+static int vfio_pci_get_unused_devs(struct pci_dev *pdev, void *data)
 {
 	struct vfio_devices *devs = data;
 	struct vfio_device *device;
+	struct vfio_pci_device *vdev;
 
 	if (devs->cur_index == devs->max_index)
 		return -ENOSPC;
@@ -1343,16 +1435,25 @@ static int vfio_pci_get_devs(struct pci_dev *pdev, void *data)
 		return -EBUSY;
 	}
 
+	vdev = vfio_device_data(device);
+
+	/* Only collect unused devices */
+	if (vdev->refcnt) {
+		vfio_device_put(device);
+		return -EBUSY;
+	}
+
 	devs->devices[devs->cur_index++] = device;
 	return 0;
 }
 
 /*
- * Attempt to do a bus/slot reset if there are devices affected by a reset for
- * this device that are needs_reset and all of the affected devices are unused
- * (!refcnt).  Callers are required to hold driver_lock when calling this to
- * prevent device opens and concurrent bus reset attempts.  We prevent device
- * unbinds by acquiring and holding a reference to the vfio_device.
+ * Attempt to do a bus/slot reset if there are devices affected by a reset
+ * for this device that are "needs_reset" and all of the affected devices
+ * are unused (!refcnt).  Callers are required to hold vdev->reflck when
+ * calling this to prevent concurrent device opens.  reflck is acquired and
+ * released on all collateral devices.  A vfio_device reference is also
+ * acquired for each affected device to prevent unbinds.
  *
  * NB: vfio-core considers a group to be viable even if some devices are
  * bound to drivers like pci-stub or pcieport.  Here we require all devices
@@ -1363,7 +1464,7 @@ static void vfio_pci_try_bus_reset(struct vfio_pci_device *vdev)
 {
 	struct vfio_devices devs = { .cur_index = 0 };
 	int i = 0, ret = -EINVAL;
-	bool needs_reset = false, slot = false;
+	bool slot = false;
 	struct vfio_pci_device *tmp;
 
 	if (!pci_probe_reset_slot(vdev->pdev->slot))
@@ -1381,32 +1482,42 @@ static void vfio_pci_try_bus_reset(struct vfio_pci_device *vdev)
 		return;
 
 	if (vfio_pci_for_each_slot_or_bus(vdev->pdev,
-					  vfio_pci_get_devs, &devs, slot))
+					  vfio_pci_get_unused_devs,
+					  &devs, slot))
 		goto put_devs;
 
+	/* Does at least one need a reset? */
 	for (i = 0; i < devs.cur_index; i++) {
 		tmp = vfio_device_data(devs.devices[i]);
-		if (tmp->needs_reset)
-			needs_reset = true;
-		if (tmp->refcnt)
-			goto put_devs;
+		if (tmp->needs_reset) {
+			ret = pci_reset_bus(vdev->pdev);
+			break;
+		}
 	}
 
-	if (needs_reset)
-		ret = pci_reset_bus(vdev->pdev);
-
 put_devs:
 	for (i = 0; i < devs.cur_index; i++) {
 		tmp = vfio_device_data(devs.devices[i]);
-		if (!ret)
+
+		/*
+		 * If reset was successful, affected devices no longer need
+		 * a reset and we should return all the collateral devices
+		 * to low power.  If not successful, we either didn't reset
+		 * the bus or timed out waiting for it, so let's not touch
+		 * the power state.
+		 */
+		if (!ret) {
 			tmp->needs_reset = false;
 
-		if (!tmp->refcnt && !disable_idle_d3)
-			pci_set_power_state(tmp->pdev, PCI_D3hot);
+			if (tmp != vdev && !disable_idle_d3)
+				pci_set_power_state(tmp->pdev, PCI_D3hot);
+		}
 
 		vfio_device_put(devs.devices[i]);
 	}
 
+	printk("%s: %d\n", __func__, ret); // DEBUG REMOVE ME
+
 	kfree(devs.devices);
 }
 
diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
index cde3b5d3441a..aa2355e67340 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -76,6 +76,11 @@ struct vfio_pci_dummy_resource {
 	struct list_head	res_next;
 };
 
+struct vfio_pci_reflck {
+	struct kref		kref;
+	struct mutex		lock;
+};
+
 struct vfio_pci_device {
 	struct pci_dev		*pdev;
 	void __iomem		*barmap[PCI_STD_RESOURCE_END + 1];
@@ -104,6 +109,7 @@ struct vfio_pci_device {
 	bool			needs_reset;
 	bool			nointx;
 	struct pci_saved_state	*pci_saved_state;
+	struct vfio_pci_reflck	*reflck;
 	int			refcnt;
 	int			ioeventfds_nr;
 	struct eventfd_ctx	*err_trigger;