RE: Virtualizing MSI-X on IMS via VFIO

"Tian, Kevin" <kevin.tian@xxxxxxxxx> · Wed, 23 Jun 2021 06:12:05 +0000

> From: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
> Sent: Wednesday, June 23, 2021 7:59 AM
> 
[...]
> I only can assume that back then the main problem was vector exhaustion
> on the host and to avoid allocating memory for interrupt descriptors
> etc, right?
> 
> The host vector exhaustion problem was that each MSIX vector consumed a
> real CPU vector which is a limited resource on X86. This is not longer
> the case today:
> 
>     1) pci_msix_enable[range]() consumes exactly zero CPU vectors from
>        the allocatable range independent of the number of MSIX vectors
>        it allocates, unless it is in multi-queue managed mode where it
>        will actually reserve a vector (maybe two) per CPU.
> 
>        But for devices which are not using that mode, they just
>        opportunistically "reserve" vectors.
> 
>        All entries are initialized with a special system vector which
>        when raised will emit a nastigram in dmesg.
> 
>     2) request_irq() actually allocates a CPU vector from the
>        allocatable vector space which can obviously still fail, which is
>        perfectly fine.
> 
> So the only downside today of allocating more MSI-X vectors than
> necessary is memory consumption for the irq descriptors.

Curious about irte entry when IRQ remapping is enabled. Is it also
allocated at request_irq()?

> 
> Though for virtualization there is still another problem:
> 
>   Even if all possible MSI-X vectors for a passthrough PCI device would
>   be allocated upfront independent of the actual usage in the guest,
>   then there is still the issue of request_irq() failing on the host
>   once the guest decides to use any of those interrupts.
> 
> It's a halfways reasonable argumentation by some definition of
> reasonable, that this case would be a host system configuration problem
> and the admin who overcommitted is responsible for the consequence.
> 
> Where the only reasonable consequence is to kill the guest right there
> because there is no mechanism to actually tell it that the host ran out
> of resources.
> 
> Not at all a pretty solution, but it is contrary to the status quo well
> defined. The most important aspect is that it is well defined for the
> case of success:
> 
>   If it succeeds then there is no way that already requested interrupts
>   can be lost or end up being redirected to the legacy PCI irq due to
>   clearing the MSIX enable bit, which is a gazillion times better than
>   the "let's hope it works" based tinkerware we have now.

fair enough.

> 
> So, aside of the existing VFIO/PCI/MSIX thing being just half thought
> out, even thinking about proliferating this with IMS is bonkers.
> 
> IMS is meant to avoid the problem of MSI-X which needs to disable MSI-X
> in order to expand the number of vectors. The use cases are going to be
> even more dynamic than the usual device drivers, so the lost interrupt
> issue will be much more likely to trigger.
> 
> So no, we are not going to proliferate this complete ignorance of how
> MSI-X actually works and just cram another "feature" into code which is
> known to be incorrect.
> 

So the correct flow is like below:

    guest::enable_msix()
      trapped_by_host()
        pci_alloc_irq_vectors(); // for all possible vMSI-X entries
          pci_enable_msix();      

    guest::unmask()
      trapped_by_host()
        request_irqs();

the first trap calls a new VFIO ioctl e.g. VFIO_DEVICE_ALLOC_IRQS.

the 2nd trap can reuse existing VFIO_DEVICE_SET_IRQS which just
does request_irq() if specified irqs have been allocated.

Then map ims to this flow:

    guest::enable_msix()
      trapped_by_host()
        msi_domain_alloc_irqs(); // for all possible vMSI-X entries
        for_all_allocated_irqs(i)
          pci_update_msi_desc_id(i, default_pasid); // a new helper func

    guest::unmask(entry#0) 
      trapped_by_host()
        request_irqs();
          ims_array_irq_startup(); // write msi_desc.id (default_pasid) to ims entry

    guest::set_msix_perm(entry#1, guest_sva_pasid)
      trapped_by_host()
        pci_update_msi_desc_id(1, host_sva_pasid);

    guest::unmask(entry#1)
      trapped_by_host()
        request_irqs();
          ims_array_irq_startup(); // write msi_desc.id (host_sva_pasid) to ims entry

Does above match your thoughts?

Thanks
Kevin