On Fri, Nov 06 2020 at 20:12, Jason Gunthorpe wrote: > All IMS device drivers will work correctly. No VMM device emulation is > ever needed to translate addr/data pairs. > > Earlier in this thread Kevin said hyper-v is already working this way, > even for MSI/MSI-X. To me this says it is fundamentally a KVM platform > problem and it should not be solved by PCI capability flags. I mostly agree but want to add a few clarifications about the terminology and the boundaries because I think there is where lot of the confusion comes from. Let me go back to the basic structure both at the hardware and at the software level. The basic structure is: [CPU] -- [Bridge] -- Bus -- [Device] This applies to all kind of buses where the bridge directly translates into the CPUs address space. Now let's look at the boundaries: | | [CPU] -- [Bri | dge] -- Bus -- [Device] | | The boundary is in the middle of the bridge because the CPU side of the bridge is obviously CPU and therefore architecture specific. The Bus side of the bridge is architecture agnostic. Now let's add an IOMMU: [CPU] -- [IOMMU] -- [Bridge] -- Bus -- [Device] and in theory the boundary moves now to: | | [CPU] -- [IO | MMU] -- [Bridge] -- Bus -- [Device] | | because with an IOMMU the bridge could become CPU and architecture agnostic. In reality this is not the case as the bridge is still the same thing. Now let's look at MSI. As established above, the Bus and the Device are CPU and architecture agnostic and the Device merily uses a composed message which is stored at some place accessible to the device to send that message when it raises an interrupt. So where is this message composed? The basic case: | | [CPU] -- [Bri | dge] -- Bus -- [Device] | Alloc + Compose Store Use The Bridge is irrelevant here as it just is involved in the transport. Nevertheless the Bridge is only transport in the view of the interrupt subsystem. The IOMMU case: | | [CPU] -- [IO | MMU] -- [Bridge] -- Bus -- [Device] | Alloc + Alloc Compose Store Use That's exactly reflected in hierarchical irq domains: | | [CPU] -- [Bri | dge] -- Bus -- [Device] | Alloc + Compose Store Use Vectordomain Busdomain and: | | [CPU] -- [IO | MMU] -- [Bridge] -- Bus -- [Device] | Alloc + Alloc Compose Store Use Vectordomain Remapdomain Busdomain Now if we look at the virtualization scenario and device hand through then the structure in the guest view is not any different from the basic case. This works with PCI-MSI[X] and the IDXD IMS variant because the hypervisor can trap the access to the storage and translate the message: | | [CPU] -- [Bri | dge] -- Bus -- [Device] | Alloc + Compose Store Use | | Trap v Hypervisor translates and stores But obviously with an IMS storage location which is software controlled by the guest side driver (the case Jason is interested in) the above cannot work for obvious reasons. That means the guest needs a way to ask the hypervisor for a proper translation, i.e. a hypercall. Now where to do that? Looking at the above remapping case it's pretty obvious: | | [CPU] -- [VI | RT] -- [Bridge] -- Bus -- [Device] | Alloc "Compose" Store Use Vectordomain HCALLdomain Busdomain | ^ | | v | Hypervisor Alloc + Compose Why? Because it reflects the boundaries and leaves the busdomain part agnostic as it should be. And it works for _all_ variants of Busdomains. Now the question which I can't answer is whether this can work correctly in terms of isolation. If the IMS storage is in guest memory (queue storage) then the guest driver can obviously write random crap into it which the device will happily send. (For MSI and IDXD style IMS it still can trap the store). Is the IOMMU/Interrupt remapping unit able to catch such messages which go outside the space to which the guest is allowed to signal to? If yes, problem solved. If no, then IMS storage in guest memory can't ever work. Coming back to this: > In the end pci_subdevice_msi_create_irq_domain() is a platform > function. Either it should work completely on every device with no > device-specific emulation required in the VMM, or it should not work > at all and return -EOPNOTSUPP. The subdevice domain is a 'Busdomain' according to the structure above. It does not and should never have any clue about the underlying system. It's in the agnostic part and always works. It simply does not care what's underneath. So it won't return -EOPNOTSUPP. What it has to do is to transport the IMS in queue memory requirement to the underlying parent domain. So in case that the HCALL domain is missing, the Vector domain needs return an error code on domain creation. If the HCALL domain is there then the domain creation works and in case of actual interrupt allocation the hypercall either returns a valid composed message or an appropriate error code. But there's a catch: This only works when the guest OS actually knows that it runs in a VM. If the guest can't figure that out, i.e. via CPUID, this cannot be solved because from the guest OS view that's the same as running on bare metal. Obviously on bare metal the Vector domain can and must handle this. So this needs some thought. Thanks, tglx