Hi Nicolin, > -----Original Message----- > From: Nicolin Chen <nicolinc@xxxxxxxxxx> > Sent: Saturday, January 11, 2025 3:32 AM > To: will@xxxxxxxxxx; robin.murphy@xxxxxxx; jgg@xxxxxxxxxx; > kevin.tian@xxxxxxxxx; tglx@xxxxxxxxxxxxx; maz@xxxxxxxxxx; > alex.williamson@xxxxxxxxxx > Cc: joro@xxxxxxxxxx; shuah@xxxxxxxxxx; reinette.chatre@xxxxxxxxx; > eric.auger@xxxxxxxxxx; yebin (H) <yebin10@xxxxxxxxxx>; > apatel@xxxxxxxxxxxxxxxx; shivamurthy.shastri@xxxxxxxxxxxxx; > bhelgaas@xxxxxxxxxx; anna-maria@xxxxxxxxxxxxx; yury.norov@xxxxxxxxx; > nipun.gupta@xxxxxxx; iommu@xxxxxxxxxxxxxxx; linux- > kernel@xxxxxxxxxxxxxxx; linux-arm-kernel@xxxxxxxxxxxxxxxxxxx; > kvm@xxxxxxxxxxxxxxx; linux-kselftest@xxxxxxxxxxxxxxx; > patches@xxxxxxxxxxxxxxx; jean-philippe@xxxxxxxxxx; mdf@xxxxxxxxxx; > mshavit@xxxxxxxxxx; Shameerali Kolothum Thodi > <shameerali.kolothum.thodi@xxxxxxxxxx>; smostafa@xxxxxxxxxx; > ddutile@xxxxxxxxxx > Subject: [PATCH RFCv2 00/13] iommu: Add MSI mapping support with > nested SMMU > > [ Background ] > On ARM GIC systems and others, the target address of the MSI is translated > by the IOMMU. For GIC, the MSI address page is called "ITS" page. When > the > IOMMU is disabled, the MSI address is programmed to the physical location > of the GIC ITS page (e.g. 0x20200000). When the IOMMU is enabled, the ITS > page is behind the IOMMU, so the MSI address is programmed to an > allocated > IO virtual address (a.k.a IOVA), e.g. 0xFFFF0000, which must be mapped to > the physical ITS page: IOVA (0xFFFF0000) ===> PA (0x20200000). > When a 2-stage translation is enabled, IOVA will be still used to program > the MSI address, though the mappings will be in two stages: > IOVA (0xFFFF0000) ===> IPA (e.g. 0x80900000) ===> PA (0x20200000) > (IPA stands for Intermediate Physical Address). > > If the device that generates MSI is attached to an IOMMU_DOMAIN_DMA, > the > IOVA is dynamically allocated from the top of the IOVA space. If attached > to an IOMMU_DOMAIN_UNMANAGED (e.g. a VFIO passthrough device), the > IOVA is > fixed to an MSI window reported by the IOMMU driver via > IOMMU_RESV_SW_MSI, > which is hardwired to MSI_IOVA_BASE (IOVA==0x8000000) for ARM > IOMMUs. > > So far, this IOMMU_RESV_SW_MSI works well as kernel is entirely in charge > of the IOMMU translation (1-stage translation), since the IOVA for the ITS > page is fixed and known by kernel. However, with virtual machine enabling > a nested IOMMU translation (2-stage), a guest kernel directly controls the > stage-1 translation with an IOMMU_DOMAIN_DMA, mapping a vITS page (at > an > IPA 0x80900000) onto its own IOVA space (e.g. 0xEEEE0000). Then, the host > kernel can't know that guest-level IOVA to program the MSI address. > > There have been two approaches to solve this problem: > 1. Create an identity mapping in the stage-1. VMM could insert a few RMRs > (Reserved Memory Regions) in guest's IORT. Then the guest kernel would > fetch these RMR entries from the IORT and create an > IOMMU_RESV_DIRECT > region per iommu group for a direct mapping. Eventually, the mappings > would look like: IOVA (0x8000000) === IPA (0x8000000) ===> 0x20200000 > This requires an IOMMUFD ioctl for kernel and VMM to agree on the IPA. > 2. Forward the guest-level MSI IOVA captured by VMM to the host-level GIC > driver, to program the correct MSI IOVA. Forward the VMM-defined vITS > page location (IPA) to the kernel for the stage-2 mapping. Eventually: > IOVA (0xFFFF0000) ===> IPA (0x80900000) ===> PA (0x20200000) > This requires a VFIO ioctl (for IOVA) and an IOMMUFD ioctl (for IPA). > > Worth mentioning that when Eric Auger was working on the same topic > with > the VFIO iommu uAPI, he had the approach (2) first, and then switched to > the approach (1), suggested by Jean-Philippe for reduction of complexity. > > The approach (1) basically feels like the existing VFIO passthrough that > has a 1-stage mapping for the unmanaged domain, yet only by shifting the > MSI mapping from stage 1 (guest-has-no-iommu case) to stage 2 (guest-has- > iommu case). So, it could reuse the existing IOMMU_RESV_SW_MSI piece, > by > sharing the same idea of "VMM leaving everything to the kernel". > > The approach (2) is an ideal solution, yet it requires additional effort > for kernel to be aware of the 1-stage gIOVA(s) and 2-stage IPAs for vITS > page(s), which demands VMM to closely cooperate. > * It also brings some complicated use cases to the table where the host > or/and guest system(s) has/have multiple ITS pages. I had done some basic sanity tests with this series and the Qemu branches you provided on a HiSilicon hardwrae. The basic dev assignment works fine. I will rebase my Qemu smuv3-accel branch on top of this and will do some more tests. One confusion I have about the above text is, do we still plan to support the approach -1( Using RMR in Qemu) or you are just mentioning it here because it is still possible to make use of that. I think from previous discussions the argument was to adopt a more dedicated MSI pass-through model which I think is approach-2 here. Could you please confirm. Thanks, Shameer