Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection

Jason Gunthorpe <jgg@xxxxxxxxxx> · Sun, 8 Nov 2020 19:23:41 -0400

On Sun, Nov 08, 2020 at 07:47:24PM +0100, Thomas Gleixner wrote:
> 
> That means the guest needs a way to ask the hypervisor for a proper
> translation, i.e. a hypercall. Now where to do that? Looking at the
> above remapping case it's pretty obvious:
> 
> 
>                      |
>                      |
>   [CPU]       -- [VI | RT]  -- [Bridge] --    Bus    -- [Device]
>                      |
>   Alloc          "Compose"                   Store         Use
> 
>   Vectordomain   HCALLdomain                Busdomain
>                  |        ^
>                  |        |
>                  v        | 
>             Hypervisor    
>                Alloc + Compose

Yes, this will describes what I have been thinking

> Now the question which I can't answer is whether this can work correctly
> in terms of isolation. If the IMS storage is in guest memory (queue
> storage) then the guest driver can obviously write random crap into it
> which the device will happily send. (For MSI and IDXD style IMS it
> still can trap the store).

There are four cases of interest here:

 1) Bare metal, PF and VF devices just deliver whatever addr/data pairs
    to the APIC. IMS works perfectly with pci_subdevice_msi_create_irq_domain()

 2) SRIOV VF assigned to the guest.

    The guest can cause any MemWr TLP to any addr/data pair
    and the iommu/platform/vmm is supposed to use the
    Bus/device/function to isolate & secure the interrupt address
    range.

    IMS can work in the guest if the guest knows the details of the
    address range and can make hypercalls to setup routing. So
    pci_subdevice_msi_create_irq_domain() works if the hypercalls
    exist and fails if they don't.

 3) SIOV sub device assigned to the guest.

    The difference between SIOV and SRIOV is the device must attach a
    PASID to every TLP triggered by the guest. Logically we'd expect
    when IMS is used in this situation the interrupt MemWr is tagged
    with bus/device/function/PASID to uniquly ID the guest and the same
    security protection scheme from #2 applies.

 4) SIOV sub device assigned to the guest, but with emulation.

    This SIOV device cannot tag interrupts with PASID so cannot do #2
    (or the platform cannot recieve a PASID tagged interrupt message).

    Since the interrupts are being delivered with TLPs pointing at the
    hypervisor the only solution is for the hypervisor to exclusively
    control the interrupt table. MSI table like emulation for IMS is
    needed and the hypervisor will use pci_subdevice_msi_create_irq_domain()
    to get the real interrupts.

    pci_subdevice_msi_create_irq_domain() needs to return the 'fake'
    addr/data pairs which are actually an ABI between the guest and
    hypervisor carried in the hidden hypercall of the emulation.
    (ie it works like MSI works today)

IDXD is worring about case #4, I think, but I didn't follow in that
whole discussion about the IMS table layout if they PASID tag the IMS
MemWr or not?? Ashok can you clarify?

> Is the IOMMU/Interrupt remapping unit able to catch such messages which
> go outside the space to which the guest is allowed to signal to? If yes,
> problem solved. If no, then IMS storage in guest memory can't ever work.

Right. Only PASID on the interrupt messages can resolve this securely.

> So in case that the HCALL domain is missing, the Vector domain needs
> return an error code on domain creation. If the HCALL domain is there
> then the domain creation works and in case of actual interrupt
> allocation the hypercall either returns a valid composed message or an
> appropriate error code.

Yes

> But there's a catch:
> 
> This only works when the guest OS actually knows that it runs in a
> VM. If the guest can't figure that out, i.e. via CPUID, this cannot be
> solved because from the guest OS view that's the same as running on bare
> metal. Obviously on bare metal the Vector domain can and must handle
> this.

Yes

The flip side is today, the way pci_subdevice_msi_create_irq_domain()
works a VF using it on baremetal will succeed and if that same VF is
assigned to a guest then pci_subdevice_msi_create_irq_domain()
succeeds but the interrupt never comes - so the driver is broken.

Yes, if we add some ACPI/etc flag that is not going to magically fix
old kvm's, but at least *something* exists that works right and
generically.

If we follow Intel's path then we need special KVM level support for
*every* device, PCI cap mangling and so on. Forever. Sounds horrible
to me..

This feels like one of these things where no matter what we do
something is broken. Picking the least breakage is the challenge here.

> So this needs some thought.

I think your HAOLL diagram is the only sane architecture.

If go that way then case #4 will still work, in this case the HCALL
will return addr/data pairs that conform to what the emulation
expects. Or fail if the VMM can't do emulation for the device.

Jason