On Wed, 31 Oct 2018 15:37:06 -0600 Alex Williamson <alex.williamson@xxxxxxxxxx> wrote: > On Wed, 31 Oct 2018 14:04:46 +0100 > Christian Ehrhardt <christian.ehrhardt@xxxxxxxxxxxxx> wrote: > ... > > With that in place I went back to my 4 modes of the simplified testcase. > > - Threaded FD closes were sped up to ~1.3 seconds > > - Forked FD closes were sped up to ~1.4 seconds > > - sequential close and exiting the process still took ~18 seconds. > > > > It is clear why the sequential loop takes that time, as each of the > > close call will clear the FD when returning to userspace. > > > > The process exit (that also matches sending a sigkill to the process > > as libvirt does) will also take the full 18 seconds as it seems to > > serialize the FD cleanup from task_work_run. > > > > At some point in the future we might make qemu smarter to drop devices > > concurrently as in my test, but until we have some kernel change > > allowing to de-serialize the waits userspace efforts are in vain for > > now. > > > > I wanted to ask all of you, but mostly Alex TBH, if there is an idea > > how to either allow the resets to be concurrent? > > Or if that is impossible due to some parts of the PCI spec if there > > would be a way to coalesce multiple into just one wait. > > We can certainly allow concurrent resets, there's nothing in the PCI > spec preventing that. This serialization came about because vfio-pci > initially used an atomic for reference counting of devices, but that > was racy as it doesn't really prevent the device from being re-opened > while we're doing the final release cleanup, or opened again before > we've completed initialization. The global mutex was a simple solution > without really considering that resets could be slow. There was however > a bit of a benefit to the global mutex for cases where we needed to do > a bus reset with multiple affected devices, ie. where we call > vfio_pci_try_bus_reset(). Serializing everyone there means that we > don't have, for example, device A and device B, which are both affected > by the same bus reset trying to perform the operation at the same time > and neither succeeding. Based on your patch, I take it that you're > mostly concerned about the case of only having a single device on the > bus such that pci_try_reset_function() takes the bus reset path, as I'd > expect using Telsa GPUs which don't have an audio function on the same > bus. That patch of course allows a new open of the device while it's > being reset, which is not good. > > There's an untested patch below that removes driver_lock and instead > replaces it with a per-device reflck mutex, which seems like a step in > the right direction. I'll do some testing on it, but maybe you'll have > an opportunity first. For the case of a single device per bus where we > can use pci_try_reset_function(), I think it should resolve the issue. > The vfio_pci_try_bus_reset() path is updated as well, but I'm afraid it > just exposes us to the race I describe above. We're optimizing the > parallel-izing of the reset perhaps ahead of the effectiveness of the > reset and I'm not sure yet if that's the best solution. I'll keep > thinking about it. Thanks, I modified your test program so that I could use a device with five functions, all in the same group, none of which support a function level reset. This is the scenario where using a per device mutex fails because no individual function can ever acquire the mutex for all the other functions and we never perform the bus reset when they're all released. In the existing implementation, the release of the last device on the bus is able to perform the reset as the other devices are blocked from being re-opened by that global mutex. Thus the patch I proposed works for your case, but it's a regression when there are dependencies between devices in order to perform that bus reset. I'm not sure how to resolve that. Thanks, Alex