On Wed, 31 Oct 2018 14:04:46 +0100 Christian Ehrhardt <christian.ehrhardt@xxxxxxxxxxxxx> wrote: ... > With that in place I went back to my 4 modes of the simplified testcase. > - Threaded FD closes were sped up to ~1.3 seconds > - Forked FD closes were sped up to ~1.4 seconds > - sequential close and exiting the process still took ~18 seconds. > > It is clear why the sequential loop takes that time, as each of the > close call will clear the FD when returning to userspace. > > The process exit (that also matches sending a sigkill to the process > as libvirt does) will also take the full 18 seconds as it seems to > serialize the FD cleanup from task_work_run. > > At some point in the future we might make qemu smarter to drop devices > concurrently as in my test, but until we have some kernel change > allowing to de-serialize the waits userspace efforts are in vain for > now. > > I wanted to ask all of you, but mostly Alex TBH, if there is an idea > how to either allow the resets to be concurrent? > Or if that is impossible due to some parts of the PCI spec if there > would be a way to coalesce multiple into just one wait. We can certainly allow concurrent resets, there's nothing in the PCI spec preventing that. This serialization came about because vfio-pci initially used an atomic for reference counting of devices, but that was racy as it doesn't really prevent the device from being re-opened while we're doing the final release cleanup, or opened again before we've completed initialization. The global mutex was a simple solution without really considering that resets could be slow. There was however a bit of a benefit to the global mutex for cases where we needed to do a bus reset with multiple affected devices, ie. where we call vfio_pci_try_bus_reset(). Serializing everyone there means that we don't have, for example, device A and device B, which are both affected by the same bus reset trying to perform the operation at the same time and neither succeeding. Based on your patch, I take it that you're mostly concerned about the case of only having a single device on the bus such that pci_try_reset_function() takes the bus reset path, as I'd expect using Telsa GPUs which don't have an audio function on the same bus. That patch of course allows a new open of the device while it's being reset, which is not good. There's an untested patch below that removes driver_lock and instead replaces it with a per-device reflck mutex, which seems like a step in the right direction. I'll do some testing on it, but maybe you'll have an opportunity first. For the case of a single device per bus where we can use pci_try_reset_function(), I think it should resolve the issue. The vfio_pci_try_bus_reset() path is updated as well, but I'm afraid it just exposes us to the race I describe above. We're optimizing the parallel-izing of the reset perhaps ahead of the effectiveness of the reset and I'm not sure yet if that's the best solution. I'll keep thinking about it. Thanks, Alex commit d763ff034e08f6f7146a2a5bb90aacb66dcb3c8c Author: Alex Williamson <alex.williamson@xxxxxxxxxx> Date: Wed Oct 31 13:22:33 2018 -0600 vfio/pci: Parallelize device open and release In commit 61d792562b53 ("vfio-pci: Use mutex around open, release, and remove") a mutex was added to freeze the refcnt for a device so that we can handle errors and perform bus resets on final close. However, bus resets can be rather slow and a global mutex here is undesirable. Convert to a per-device mutex, acquiring each necessary mutex for all affected devices. Reported-by: Christian Ehrhardt <christian.ehrhardt@xxxxxxxxxxxxx> Signed-off-by: Alex Williamson <alex.williamson@xxxxxxxxxx> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c index 50cdedfca9fe..66274adb407e 100644 --- a/drivers/vfio/pci/vfio_pci.c +++ b/drivers/vfio/pci/vfio_pci.c @@ -56,8 +56,6 @@ module_param(disable_idle_d3, bool, S_IRUGO | S_IWUSR); MODULE_PARM_DESC(disable_idle_d3, "Disable using the PCI D3 low power state for idle, unused devices"); -static DEFINE_MUTEX(driver_lock); - static inline bool vfio_vga_disabled(void) { #ifdef CONFIG_VFIO_PCI_VGA @@ -393,14 +391,14 @@ static void vfio_pci_release(void *device_data) { struct vfio_pci_device *vdev = device_data; - mutex_lock(&driver_lock); + mutex_lock(&vdev->reflck); if (!(--vdev->refcnt)) { vfio_spapr_pci_eeh_release(vdev->pdev); vfio_pci_disable(vdev); } - mutex_unlock(&driver_lock); + mutex_unlock(&vdev->reflck); module_put(THIS_MODULE); } @@ -413,7 +411,7 @@ static int vfio_pci_open(void *device_data) if (!try_module_get(THIS_MODULE)) return -ENODEV; - mutex_lock(&driver_lock); + mutex_lock(&vdev->reflck); if (!vdev->refcnt) { ret = vfio_pci_enable(vdev); @@ -424,7 +422,7 @@ static int vfio_pci_open(void *device_data) } vdev->refcnt++; error: - mutex_unlock(&driver_lock); + mutex_unlock(&vdev->reflck); if (ret) module_put(THIS_MODULE); return ret; @@ -1222,6 +1220,7 @@ static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id) vdev->pdev = pdev; vdev->irq_type = VFIO_PCI_NUM_IRQS; mutex_init(&vdev->igate); + mutex_init(&vdev->reflck); spin_lock_init(&vdev->irqlock); mutex_init(&vdev->ioeventfds_lock); INIT_LIST_HEAD(&vdev->ioeventfds_list); @@ -1322,14 +1321,16 @@ static struct pci_driver vfio_pci_driver = { struct vfio_devices { struct vfio_device **devices; + struct vfio_pci_device *base_vdev; int cur_index; int max_index; }; -static int vfio_pci_get_devs(struct pci_dev *pdev, void *data) +static int vfio_pci_get_unused_devs(struct pci_dev *pdev, void *data) { struct vfio_devices *devs = data; struct vfio_device *device; + struct vfio_pci_device *vdev; if (devs->cur_index == devs->max_index) return -ENOSPC; @@ -1343,16 +1344,33 @@ static int vfio_pci_get_devs(struct pci_dev *pdev, void *data) return -EBUSY; } + vdev = vfio_device_data(device); + + /* reflck is already held on base_vdev */ + if (vdev != devs->base_vdev && !mutex_trylock(&vdev->reflck)) { + vfio_device_put(device); + return -EBUSY; + } + + /* Only collect unused devices */ + if (vdev->refcnt) { + if (vdev != devs->base_vdev) + mutex_unlock(&vdev->reflck); + vfio_device_put(device); + return -EBUSY; + } + devs->devices[devs->cur_index++] = device; return 0; } /* - * Attempt to do a bus/slot reset if there are devices affected by a reset for - * this device that are needs_reset and all of the affected devices are unused - * (!refcnt). Callers are required to hold driver_lock when calling this to - * prevent device opens and concurrent bus reset attempts. We prevent device - * unbinds by acquiring and holding a reference to the vfio_device. + * Attempt to do a bus/slot reset if there are devices affected by a reset + * for this device that are "needs_reset" and all of the affected devices + * are unused (!refcnt). Callers are required to hold vdev->reflck when + * calling this to prevent concurrent device opens. reflck is acquired and + * released on all collateral devices. A vfio_device reference is also + * acquired for each affected device to prevent unbinds. * * NB: vfio-core considers a group to be viable even if some devices are * bound to drivers like pci-stub or pcieport. Here we require all devices @@ -1361,9 +1379,9 @@ static int vfio_pci_get_devs(struct pci_dev *pdev, void *data) */ static void vfio_pci_try_bus_reset(struct vfio_pci_device *vdev) { - struct vfio_devices devs = { .cur_index = 0 }; + struct vfio_devices devs = { .cur_index = 0, .base_vdev = vdev }; int i = 0, ret = -EINVAL; - bool needs_reset = false, slot = false; + bool slot = false; struct vfio_pci_device *tmp; if (!pci_probe_reset_slot(vdev->pdev->slot)) @@ -1381,28 +1399,40 @@ static void vfio_pci_try_bus_reset(struct vfio_pci_device *vdev) return; if (vfio_pci_for_each_slot_or_bus(vdev->pdev, - vfio_pci_get_devs, &devs, slot)) + vfio_pci_get_unused_devs, + &devs, slot)) goto put_devs; + /* Does at least one need a reset? */ for (i = 0; i < devs.cur_index; i++) { tmp = vfio_device_data(devs.devices[i]); - if (tmp->needs_reset) - needs_reset = true; - if (tmp->refcnt) - goto put_devs; + if (tmp->needs_reset) { + ret = pci_reset_bus(vdev->pdev); + break; + } } - if (needs_reset) - ret = pci_reset_bus(vdev->pdev); - put_devs: for (i = 0; i < devs.cur_index; i++) { tmp = vfio_device_data(devs.devices[i]); - if (!ret) + + /* + * If reset was successful, affected devices no longer need + * a reset and we should return all the collateral devices + * to low power. If not successful, we either didn't reset + * the bus or timed out waiting for it, so let's not touch + * the power state. + */ + if (!ret) { tmp->needs_reset = false; - if (!tmp->refcnt && !disable_idle_d3) - pci_set_power_state(tmp->pdev, PCI_D3hot); + if (tmp != vdev && !disable_idle_d3) + pci_set_power_state(tmp->pdev, PCI_D3hot); + } + + /* Release reflck only on collateral devices */ + if (tmp != vdev) + mutex_unlock(&tmp->reflck); vfio_device_put(devs.devices[i]); } diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h index cde3b5d3441a..666f6085bb72 100644 --- a/drivers/vfio/pci/vfio_pci_private.h +++ b/drivers/vfio/pci/vfio_pci_private.h @@ -104,6 +104,7 @@ struct vfio_pci_device { bool needs_reset; bool nointx; struct pci_saved_state *pci_saved_state; + struct mutex reflck; int refcnt; int ioeventfds_nr; struct eventfd_ctx *err_trigger;