Re: [PATCH vfio] vfio/pci: remove msi domain on msi disable

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 9/18/2023 7:17 AM, Jason Gunthorpe wrote:

On Thu, Sep 14, 2023 at 12:14:06PM -0700, Shannon Nelson wrote:
The new MSI dynamic allocation machinery is great for making the irq
management more flexible.  It includes caching information about the
MSI domain which gets reused on each new open of a VFIO fd.  However,
this causes an issue when the underlying hardware has flexible MSI-x
configurations, as a changed configuration doesn't get seen between
new opens, and is only refreshed between PCI unbind/bind cycles.

In our device we can change the per-VF MSI-x resource allocation
without the need for rebooting or function reset.  For example,

   1. Initial power up and kernel boot:
       # lspci -s 2e:00.1 -vv | grep MSI-X
               Capabilities: [a0] MSI-X: Enable+ Count=8 Masked-

   2. Device VF configuration change happens with no reset

Is this an out of tree driver problem?

No, not an out-of-tree driver, this is the vfio pci core.


The intree way to alter the MSI configuration is via
sriov_set_msix_vec_count, and there is only one in-tree driver that
uses it right now.

If something is going wrong here it should be fixed in the
sriov_set_msix_vec_count() machinery, possibly in the pci core to
synchronize the msi_domain view of the world.

Jason

The sriov_set_msix_vec_count method assumes
 (a) the unbind/bind cycle on the VF, and
 (b) VF MSIx count change configured from the host
neither of which are the case in our situation.

In our case, the VF device's msix count value found in PCI config space is changed by device configuration management outside of the baremetal host and read by the QEMU instance when it starts up, and then read by the vfio PCI core when QEMU requests the first IRQ. The core code enables the msix range on first IRQ request, and disables it when the vfio fd is closed. It creates the msi_domain on the first call if it doesn't exist, but it does not remove the msi_domain when the irqs are disabled.

The IRQ request call trace looks like:
QEMU: vfio_msix_vector_do_use->ioctl()
  (driver.vfio_device_ops.ioctl = vfio_pci_core_ioctl)
  vfio_pci_core_ioctl
    vfio_pci_ioctl_set_irqs
      vfio_pci_set_irqs_ioctl
        vfio_pci_set_msi_trigger
          vfio_msi_enable
            pci_alloc_irq_vectors
              pci_alloc_irq_vectors_affinity
                __pci_enable_msix_range
                  pci_setup_msix_device_domain
                        return if msi_domain exists
                    pci_create_device_domain
                      msi_create_device_irq_domain
                        __msi_create_irq_domain - sets info->hwsize
		  msi_capability_init
		    msi_setup_msi_desc
		      msi_insert_msi_desc
                        msi_domain_insert_msi_desc
                          msi_insert_desc
			    fail if index >= hwsize

On close of the vfio fd, the trace is:
QEMU: close()
  driver.vfio_device_ops.close
    vfio_pci_core_close_device
      vfio_pci_core_disable
        vfio_pci_set_irqs_ioctl(vdev, VFIO_IRQ_SET_DATA_NONE |
				VFIO_IRQ_SET_ACTION_TRIGGER,
				vdev->irq_type, 0, 0, NULL);
          vfio_pci_set_msi_trigger
            vfio_msi_disable
              pci_free_irq_vectors

The msix vectors are freed, but the msi_domain is not, and the msi_domain holds the MSIx count that it read when it was created. If the device's MSIx count is increased, the next QEMU session will see the new number in PCI config space and try to use that new larger number, but the msi_domain is still using the smaller hwsize and the QEMU IRQ setup fails in msi_insert_desc().

This patch adds a msi_remove_device_irq_domain() call when the irqs are disabled in order to force a new read on the next IRQ allocation cycle. This is limited to only the vfio use of the msi_domain.

I suppose we could add this to the trailing end of callbacks in our own driver, but this looks more like a generic vfio/msi issue than a driver specific thing.

The other possibility is to force the user to always do a bind cycle between QEMU sessions using the VF. This seems to be unnecessary overhead and was not necessary when using the v6.1 kernel. To the user, this looks like a regression - this is how it was reported to me.

sln



[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux