On Mon, Nov 09 2020 at 13:30, Jason Gunthorpe wrote: > On Mon, Nov 09, 2020 at 12:21:22PM +0100, Thomas Gleixner wrote: >> >> Is the IOMMU/Interrupt remapping unit able to catch such messages which >> >> go outside the space to which the guest is allowed to signal to? If yes, >> >> problem solved. If no, then IMS storage in guest memory can't ever >> >> work. > The SIOV case is to take a single RID and split it to multiple > VMs and also to the hypervisor. All these things concurrently use the > same RID, and the IOMMU can't tell them apart. > > The hypervisor security domain owns TLPs with no PASID. Each PASID is > assigned to a VM. > > For interrupts, today, they are all generated, with no PASID, to the > same RID. There is no way for remapping to protect against a guest > without checking also PASID. > > The relavance of PASID is this: > >> Again, trap emulate does not work for IMS when the IMS store is software >> managed guest memory and not part of the device. And that's the whole >> reason why we are discussing this. > > With PASID tagged interrupts and a IOMMU interrupt remapping > capability that can trigger on PASID, then the platform can provide > the same level of security as SRIOV - the above is no problem. > > The device ensures that all DMAs and all interrupts program by the > guest are PASID tagged and the platform provides security by checking > the PASID when delivering the interrupt. Correct. > Intel IOMMU doesn't work this way today, but it makes alot of design > sense. Right. > Otherwise the interrupt is effectively delivered to the hypervisor. A > secure device can *never* allow a guest to specify an addr/data pair > for a non-PASID tagged TLP, so the device cannot offer IMS to the > guest. Ok. Let me summarize the current state of supported scenarios: 1) SRIOV works with any form of IMS storage because it does not require PASID and the VF devices have unique requester ids, which allows the remap unit to sanity check the message. 2) SIOV with IMS when the hypervisor can manage the IMS store exclusively. So #2 prevents a device which handles IMS storage in queue memory to utilize IMS for SIOV in a guest because the hypervisor cannot manage the IMS message store and the guest can write arbitrary crap to it which violates the isolation principle. And here is the relevant part of the SIOV spec: "IMS is managed by host driver software and is not accessible directly from guest or user-mode drivers. Within the device, IMS storage is not accessible from the ADIs. ADIs can request interrupt generation only through the device’s ‘Interrupt Message Generation Logic’, which allows an ADI to only generate interrupt messages that are associated with that specific ADI. These restrictions ensure that the host driver has complete control over which interrupt messages can be generated by each ADI. On Intel 64 architecture platforms, message signaled interrupts are issued as DWORD size untranslated memory writes without a PASID TLP Prefix, to address range 0xFEExxxxx. Since all memory requests generated by ADIs include a PASID TLP Prefix, it is not possible for an ADI to generate a DMA write that would be interpreted by the platform as an interrupt message." That's the reductio ad absurdum for this sentence in the first paragraph of the preceding chapter describing the concept of IMS: "IMS enables devices to store the interrupt messages for ADIs in a device-specific optimized manner without the scalability restrictions of the PCI Express defined MSI-X capability." "Device-specific optimized manner" is either wishful thinking or marketing induced verbal diarrhoea. The current specification puts massive restrictions on IMS storage which are _not_ allowing to optimize it in a device specific manner as demonstrated in this discussion. It also precludes obvious use cases like passing a full device to a guest and let the guest manage SIOV subdevices for containers or nested guests. TBH, to me this is just another hastily cobbled together half thought out misfeature cast in silicon. The proposed software support is following the exactly same principle. So before we go anywhere with this, I want to see a proper way forward to support _all_ sensible use cases and to fulfil the promise of "device-specific optimized manner" at the conceptual and specification and also at the code level. I'm not at all interested to rush in support for a half baken Intel centric solution which other people have to clean up after the fact (again). IOW, it's time to go back to the drawing board. Thanks, tglx