RE: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection

"Tian, Kevin" <kevin.tian@xxxxxxxxx> · Mon, 9 Nov 2020 07:37:03 +0000

> From: Jason Gunthorpe <jgg@xxxxxxxxxx>
> Sent: Monday, November 9, 2020 7:24 AM
> 
> On Sun, Nov 08, 2020 at 07:47:24PM +0100, Thomas Gleixner wrote:
> >
> > That means the guest needs a way to ask the hypervisor for a proper
> > translation, i.e. a hypercall. Now where to do that? Looking at the
> > above remapping case it's pretty obvious:
> >
> >
> >                      |
> >                      |
> >   [CPU]       -- [VI | RT]  -- [Bridge] --    Bus    -- [Device]
> >                      |
> >   Alloc          "Compose"                   Store         Use
> >
> >   Vectordomain   HCALLdomain                Busdomain
> >                  |        ^
> >                  |        |
> >                  v        |
> >             Hypervisor
> >                Alloc + Compose
> 
> Yes, this will describes what I have been thinking

Agree

> 
> > Now the question which I can't answer is whether this can work correctly
> > in terms of isolation. If the IMS storage is in guest memory (queue
> > storage) then the guest driver can obviously write random crap into it
> > which the device will happily send. (For MSI and IDXD style IMS it
> > still can trap the store).
> 
> There are four cases of interest here:
> 
>  1) Bare metal, PF and VF devices just deliver whatever addr/data pairs
>     to the APIC. IMS works perfectly with
> pci_subdevice_msi_create_irq_domain()
> 
>  2) SRIOV VF assigned to the guest.
> 
>     The guest can cause any MemWr TLP to any addr/data pair
>     and the iommu/platform/vmm is supposed to use the
>     Bus/device/function to isolate & secure the interrupt address
>     range.
> 
>     IMS can work in the guest if the guest knows the details of the
>     address range and can make hypercalls to setup routing. So
>     pci_subdevice_msi_create_irq_domain() works if the hypercalls
>     exist and fails if they don't.
> 
>  3) SIOV sub device assigned to the guest.
> 
>     The difference between SIOV and SRIOV is the device must attach a
>     PASID to every TLP triggered by the guest. Logically we'd expect
>     when IMS is used in this situation the interrupt MemWr is tagged
>     with bus/device/function/PASID to uniquly ID the guest and the same
>     security protection scheme from #2 applies.

Unfortunately no. Intel VT-d only treats MemWr w/o PASID to 0xFEExxxxx
as interrupt request. MemWr w/ PASID, even to 0xFEE, is translated
normally through DMA remapping page table. I don't know other IOMMU
vendors. But at least on Intel platform such device would not get the 
desired effect, since the IOMMU only guarantees interrupt isolation in 
BDF-level.

Does your device already implement such capability? We can bring this 
request back to the hardware team. 

> 
>  4) SIOV sub device assigned to the guest, but with emulation.
> 
>     This SIOV device cannot tag interrupts with PASID so cannot do #2
>     (or the platform cannot recieve a PASID tagged interrupt message).
> 
>     Since the interrupts are being delivered with TLPs pointing at the
>     hypervisor the only solution is for the hypervisor to exclusively
>     control the interrupt table. MSI table like emulation for IMS is
>     needed and the hypervisor will use
> pci_subdevice_msi_create_irq_domain()
>     to get the real interrupts.
> 
>     pci_subdevice_msi_create_irq_domain() needs to return the 'fake'
>     addr/data pairs which are actually an ABI between the guest and
>     hypervisor carried in the hidden hypercall of the emulation.
>     (ie it works like MSI works today)
> 
> IDXD is worring about case #4, I think, but I didn't follow in that
> whole discussion about the IMS table layout if they PASID tag the IMS
> MemWr or not?? Ashok can you clarify?
> 
> > Is the IOMMU/Interrupt remapping unit able to catch such messages which
> > go outside the space to which the guest is allowed to signal to? If yes,
> > problem solved. If no, then IMS storage in guest memory can't ever work.
> 
> Right. Only PASID on the interrupt messages can resolve this securely.
> 
> > So in case that the HCALL domain is missing, the Vector domain needs
> > return an error code on domain creation. If the HCALL domain is there
> > then the domain creation works and in case of actual interrupt
> > allocation the hypercall either returns a valid composed message or an
> > appropriate error code.
> 
> Yes
> 
> > But there's a catch:
> >
> > This only works when the guest OS actually knows that it runs in a
> > VM. If the guest can't figure that out, i.e. via CPUID, this cannot be
> > solved because from the guest OS view that's the same as running on bare
> > metal. Obviously on bare metal the Vector domain can and must handle
> > this.
> 
> Yes
> 
> The flip side is today, the way pci_subdevice_msi_create_irq_domain()
> works a VF using it on baremetal will succeed and if that same VF is
> assigned to a guest then pci_subdevice_msi_create_irq_domain()
> succeeds but the interrupt never comes - so the driver is broken.

Yes, this is the main worry here. While all agree that using hypercall is 
the proper way to virtualize IMS, how to disable it when hypercall is
not available is a more urgent demand at current stage.

> 
> Yes, if we add some ACPI/etc flag that is not going to magically fix
> old kvm's, but at least *something* exists that works right and
> generically.

Agree. We can work together on this definition. 

btw in reality such ACPI extension doesn't exist yet, which likely will
take some time. In the meantime we already have pending usages 
like IDXD. Do you suggest holding these patches until we get ASWG 
to accept the extension, or accept using Intel IMS cap as a vendor
specific mitigation to move forward while the platform flag is being 
worked on? Anyway the IMS cap is already defined and can help fix 
some broken cases.

> 
> If we follow Intel's path then we need special KVM level support for
> *every* device, PCI cap mangling and so on. Forever. Sounds horrible
> to me..
> 
> This feels like one of these things where no matter what we do
> something is broken. Picking the least breakage is the challenge here.
> 
> > So this needs some thought.
> 
> I think your HAOLL diagram is the only sane architecture.
> 
> If go that way then case #4 will still work, in this case the HCALL
> will return addr/data pairs that conform to what the emulation
> expects. Or fail if the VMM can't do emulation for the device.
> 

Thanks
Kevin