Hi Nicolin, On Fri, 10 Jan 2025 19:32:16 -0800 Nicolin Chen <nicolinc@xxxxxxxxxx> wrote: > [ Background ] > On ARM GIC systems and others, the target address of the MSI is > translated by the IOMMU. For GIC, the MSI address page is called > "ITS" page. When the IOMMU is disabled, the MSI address is programmed > to the physical location of the GIC ITS page (e.g. 0x20200000). When > the IOMMU is enabled, the ITS page is behind the IOMMU, so the MSI > address is programmed to an allocated IO virtual address (a.k.a > IOVA), e.g. 0xFFFF0000, which must be mapped to the physical ITS > page: IOVA (0xFFFF0000) ===> PA (0x20200000). When a 2-stage > translation is enabled, IOVA will be still used to program the MSI > address, though the mappings will be in two stages: IOVA (0xFFFF0000) > ===> IPA (e.g. 0x80900000) ===> PA (0x20200000) (IPA stands for > Intermediate Physical Address). > > If the device that generates MSI is attached to an IOMMU_DOMAIN_DMA, > the IOVA is dynamically allocated from the top of the IOVA space. If > attached to an IOMMU_DOMAIN_UNMANAGED (e.g. a VFIO passthrough > device), the IOVA is fixed to an MSI window reported by the IOMMU > driver via IOMMU_RESV_SW_MSI, which is hardwired to MSI_IOVA_BASE > (IOVA==0x8000000) for ARM IOMMUs. > > So far, this IOMMU_RESV_SW_MSI works well as kernel is entirely in > charge of the IOMMU translation (1-stage translation), since the IOVA > for the ITS page is fixed and known by kernel. However, with virtual > machine enabling a nested IOMMU translation (2-stage), a guest kernel > directly controls the stage-1 translation with an IOMMU_DOMAIN_DMA, > mapping a vITS page (at an IPA 0x80900000) onto its own IOVA space > (e.g. 0xEEEE0000). Then, the host kernel can't know that guest-level > IOVA to program the MSI address. > > There have been two approaches to solve this problem: > 1. Create an identity mapping in the stage-1. VMM could insert a few > RMRs (Reserved Memory Regions) in guest's IORT. Then the guest kernel > would fetch these RMR entries from the IORT and create an > IOMMU_RESV_DIRECT region per iommu group for a direct mapping. > Eventually, the mappings would look like: IOVA (0x8000000) === IPA > (0x8000000) ===> 0x20200000 This requires an IOMMUFD ioctl for kernel > and VMM to agree on the IPA. Should this RMR be in a separate range than MSI_IOVA_BASE? The guest will have MSI_IOVA_BASE in a reserved region already, no? e.g. # cat /sys/bus/pci/devices/0015\:01\:00.0/iommu_group/reserved_regions 0x0000000008000000 0x00000000080fffff msi