Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection

Thomas Gleixner <tglx@xxxxxxxxxxxxx> · Sun, 08 Nov 2020 23:47:13 +0100

On Sun, Nov 08 2020 at 19:36, David Woodhouse wrote:
> On Sun, 2020-11-08 at 19:47 +0100, Thomas Gleixner wrote:
>> So this needs some thought.
>
> The problem here is that Intel implemented interrupt remapping in a way
> which is anathema to structured, ordered IRQ domains.
>
> When a guest writes an MSI message (addr/data) to the MSI table of a
> PCI device which has been assigned to that guest, it *doesn't* properly
> inherit the MSI composition from a parent irqdomain which knows about
> the (host-side) IOMMU.
>
> What actually happens is the hypervisor *traps* the writes to the
> device's MSI table, and translates them *then*.

That's what I showed in the ascii art :)

> In *precisely* the fashion which we're trying to avoid for IMS.

At least for the IMS variant where the storage is not in trappable
device memory.

> Now, you can imagine a world where it wasn't like this, where
> Remappable Format MSI messages don't exist, and where we let guests
> write native MSI message to the device without trapping — and where the
> IOMMU then sees the incoming interrupt and has to map the APIC ID to a
> *virtual* CPU for that guest, based on the PCI source-id of the
> device.

That would be not convoluted enough and make too much sense.

> In that world, IMS would work naturally. But that isn't how Intel
> designed interrupt remapping. They *designed* to have to trap and
> translate as the message is written to the device.
>
> So it does look like we're going to need a hypercall interface to
> compose an MSI message on behalf of the guest, for IMS to use. In fact
> PCI devices assigned to a guest could use that too, and then we'd only
> need to trap-and-remap any attempt to write a Compatibility Format MSI
> to the device's MSI table, while letting Remappable Format messages get
> written directly.

Yes, if we have the HCALL domain then the message composed by the
hypervisor is valid for everything not only IMS. That's why I left out
any specifics on the Busdomain side. It does not matter which kind of
bus that is. The only mechanics which is provided by the busdomain is
to store the precomposed message and eventually provide mask/unmask at
that level.

> We'd also need a way for an OS running on bare metal to *know* that
> it's on bare metal and can just compose MSI messages for itself. Since
> we do expect bare metal to have an IOMMU, perhaps that is just a
> feature flag on the IOMMU?

There are still CPUs w/o IOMMU out there and new ones are shipped.

So you would basically mandate that IMS with memory storage can only
work on bare metal when the CPU has an IOMMU.

Jason said in [1]: "For x86 I think we could accept linking this to
IOMMU, if really necessary."

OTOH, what's the chance that a guest runs on something which

  1) Does not have X86_FEATURE_HYPERVISOR set in cpuid 1/EDX

and

  2) Cannot be identified as Xen domain

and

  3) Does not have a DMI vendor entry which identifies the
     virtualization solution (we don't use that today, but
     adding that table is trivial enough)

and

  4) Has such an IMS device passed through?

Possible, yes. Likely, no. Do we care?

> That or Intel needs to fix the IOMMU to do proper virtualisation and
> actually translate "Compatibility Format" MSIs for a guest too.

Is that going to happen before I retire?

Thanks,

        tglx

[1] https://lore.kernel.org/r/20200822005125.GB1152540@xxxxxxxxxx