Need to scale down the time to release multiple vfio devices

Christian Ehrhardt <christian.ehrhardt@xxxxxxxxxxxxx> · Wed, 31 Oct 2018 14:04:46 +0100

Hiho everybody!

TL;DR:
We need some way to avoid holding vfio_pci.c driver_lock when it goes
to a 1 second sleep in pci_reset_secondary_bus.

## Background ##

After the discussions with Alex, Daniel and David on the KVM Forum I
did some experiments and wanted to no come back now with what I found.
Since I expect the first fix has to go to the kernel and cross posts
are generally hated I'll start on this mailing list (and the KVM list
is a nice intersection of the people we need anyway).
Once a kernel change happened we can start thinking on the related
userspace components itself.

## The Problem ##
I brought that up in the context of a libvirt issue/fix [1] a while ago.

 - kvm guests with many GPUs via vfio
 - shutdown/destroy of guest takes quite long
 - scales linearly ~1 second per device
 - time is consumed in function pci_reset_secondary_bus
 - Platforms seem to have up to 32 GPUs now which means ~32 seconds
 - after ~30 seconds libvirt e.g. timeouts hit and things get awkward
 - we told libvirt to understand some cases that need longer

But requiring that much time still feels wrong, What will happen when
systems with even more such devices are used?

With that start we discussed about options to improve the situation:
- kernel - is there an in kernel serialization?
- qemu - could it drop vfio FDs smarter?
- libvirt - what could it do in advance to the SIGTERM?

I started to create a testcase for those to help our discussions.

[1]: https://www.redhat.com/archives/libvir-list/2018-August/msg00181.html

## Experiment ##

Based on Alex vfio tests [2] I created a test [3] that will issue the
vfio-FD close in various fashions to allow us to explore the case
without the burden of considering all the code active in qemu/libvirt.

The tool allows you to run in four modes:
   e - just exit frees FDs (return from main)
   c - close device FDs in a loop before exit
   t - concurrently close device FDs before exit (threads)
   p - open and close in individual process context (fork)

In my test I usually ran with 16 GPUs and with some overhead end up
with approximately 18 seconds to release the process resources.
That is true for any of the modes above.

If anyone wants to recreate this, I think you can see the effect with
anything >=4 GPUs.

With those modes I ran different ftraces.
The most useful of these can be seen at [4].
In those traces we see the 16*1 second stall.
And the related critical area looks like that.

108681 1245.575473 | 95) $ 14791992 us | } /* __mutex_lock.isra.2 */
108684 1245.575473 | 95) $ 14791992 us | } /* __mutex_lock_slowpath */
108686 1245.575474 | 95) $ 14791994 us | } /* mutex_lock */
108688 1245.575475 | 95) | vfio_pci_disable [vfio_pci]() {
108691 1245.575475 | 95) | pci_clear_master() {
108692 1245.575476 | 95) | __pci_set_master() {

>From here it isn't too far to see:
static void vfio_pci_release(void *device_data)
{
 struct vfio_pci_device *vdev = device_data;
 mutex_lock(&driver_lock);
 if (!(--vdev->refcnt)) {
  vfio_spapr_pci_eeh_release(vdev->pdev);
  vfio_pci_disable(vdev);
 }
 mutex_unlock(&driver_lock);
 module_put(THIS_MODULE);
}

That does not look too bad until you know that vfio_pci_disable might
sleep 1 second in a stack trace like this:

 vfio-iommu-mass-3507 [042] .... 532.650135: pci_reset_secondary_bus
<-pcibios_reset_secondary_bus
 vfio-iommu-mass-3507 [042] .... 532.650138: <stack trace>
 => 0xffffffffc09c5077
 => pci_reset_secondary_bus
 => pcibios_reset_secondary_bus
 => pci_parent_bus_reset
 => __pci_reset_function_locked
 => pci_try_reset_function
 => vfio_pci_disable
 => vfio_pci_release
 => vfio_device_fops_release
 => __fput
 => ____fput
 => task_work_run
 => do_exit
 => do_group_exit
 => SyS_exit_group
 => do_syscall_64
 => entry_SYSCALL_64_after_hwframe

[2]: https://github.com/awilliam/tests
[3]: https://github.com/cpaelzer/tests/blob/vfio-test-slow-dev-release/vfio-iommu-mass-alloc-release.c
[4]: http://paste.ubuntu.com/p/Hbcvbdmgyb/

## Insights ##

After finding the above I created a not even RFC level patch [5] to
prove that a change there would help.
It essentially drops the lock before leaving vfio towards PCI code -
ignoring all the issues this might imply :-)

With that in place I went back to my 4 modes of the simplified testcase.
- Threaded FD closes were sped up to ~1.3 seconds
- Forked FD closes were sped up to ~1.4 seconds
- sequential close and exiting the process still took ~18 seconds.

It is clear why the sequential loop takes that time, as each of the
close call will clear the FD when returning to userspace.

The process exit (that also matches sending a sigkill to the process
as libvirt does) will also take the full 18 seconds as it seems to
serialize the FD cleanup from task_work_run.

At some point in the future we might make qemu smarter to drop devices
concurrently as in my test, but until we have some kernel change
allowing to de-serialize the waits userspace efforts are in vain for
now.

I wanted to ask all of you, but mostly Alex TBH, if there is an idea
how to either allow the resets to be concurrent?
Or if that is impossible due to some parts of the PCI spec if there
would be a way to coalesce multiple into just one wait.

[5]: http://paste.ubuntu.com/p/sGzMTqfhSK/

P.S. resubmit trying to convince gmail to stick to the plain text only mode :-/

-- 
Christian Ehrhardt
Software Engineer, Ubuntu Server
Canonical Ltd