On 2023-04-26 12:57, Jason Gunthorpe wrote:
On Fri, Apr 21, 2023 at 02:58:01PM -0300, Jason Gunthorpe wrote:
which for practical purposes in this context means an ITS.
I haven't delved into it super detail, but.. my impression was..
The ITS page only becomes relavent to the IOMMU layer if the actual
IRQ driver calls iommu_dma_prepare_msi()
Nicolin and I sat down and traced this through, this explanation is
almost right...
irq-gic-v4.c is some sub module of irq-gic-v3-its.c so it does end up
calling iommu_dma_prepare_msi() however..
Ignore GICv4; that basically only makes a difference to what happens
after the CPU receives an interrupt.
qemu will setup the ACPI so that VM thinks the ITS page is at
0x08080000. I think it maps some dummy CPU memory to this address.
iommufd will map the real ITS page at MSI_IOVA_BASE = 0x8000000 (!!)
and only into the IOMMU
qemu will setup some RMRR thing to make 0x8000000 1:1 at the VM's
IOMMU
When DMA API is used iommu_dma_prepare_msi() is called which will
select a MSI page address that avoids the reserved region, so it is
some random value != 0x8000000 and maps the dummy CPU page to it.
The VM will then do a MSI-X programming cycle with the S1 IOVA of the
CPU page and the data. qemu traps this and throws away the address
from the VM. The kernel sets up the interrupt and assumes 0x8000000
is the right IOVA.
When VFIO is used iommufd in the VM will force the MSI window to
0x8000000 and instead of putting a 1:1 mapping we map the dummy CPU
page and then everything is broken. Adding the reserved check is an
improvement.
The only way to properly fix this is to have qemu stop throwing away
the address during the MSI-X programming. This needs to be programmed
into the device instead.
I have no idea how best to get there with the ARM GIC setup.. It feels
really hard.
Give QEMU a way to tell IOMMUFD to associate that 0x08080000 address
with a given device as an MSI target. IOMMUFD then ensures that the S2
mapping exists from that IPA to the device's real ITS (I vaguely
remember Eric had a patch to pre-populate an MSI cookie with specific
pages, which may have been heading along those lines). In the worst case
this might mean having to subdivide the per-SMMU copies of the S2 domain
into per-ITS copies as well, so we'd probably want to detect and compare
devices' ITS parents up-front.
QEMU will presumably also need a way to pass the VA down to IOMMUFD when
it sees the guest programming the MSI (possibly it could pass the IPA at
the same time so we don't need a distinct step to set up S2 beforehand?)
- once the underlying physical MSI configuration comes back from the PCI
layer, that VA just needs to be dropped in to replace the original
msi_msg address.
TBH at that point it may be easier to just not have a cookie in the S2
domain at all when nesting is enabled, and just let IOMMUFD make the ITS
mappings directly for itself.
Thanks,
Robin.