RE: [PATCH RFCv2 00/13] iommu: Add MSI mapping support with nested SMMU

Shameerali Kolothum Thodi <shameerali.kolothum.thodi@xxxxxxxxxx> · Thu, 23 Jan 2025 09:06:49 +0000

Hi Nicolin,

> -----Original Message-----
> From: Nicolin Chen <nicolinc@xxxxxxxxxx>
> Sent: Saturday, January 11, 2025 3:32 AM
> To: will@xxxxxxxxxx; robin.murphy@xxxxxxx; jgg@xxxxxxxxxx;
> kevin.tian@xxxxxxxxx; tglx@xxxxxxxxxxxxx; maz@xxxxxxxxxx;
> alex.williamson@xxxxxxxxxx
> Cc: joro@xxxxxxxxxx; shuah@xxxxxxxxxx; reinette.chatre@xxxxxxxxx;
> eric.auger@xxxxxxxxxx; yebin (H) <yebin10@xxxxxxxxxx>;
> apatel@xxxxxxxxxxxxxxxx; shivamurthy.shastri@xxxxxxxxxxxxx;
> bhelgaas@xxxxxxxxxx; anna-maria@xxxxxxxxxxxxx; yury.norov@xxxxxxxxx;
> nipun.gupta@xxxxxxx; iommu@xxxxxxxxxxxxxxx; linux-
> kernel@xxxxxxxxxxxxxxx; linux-arm-kernel@xxxxxxxxxxxxxxxxxxx;
> kvm@xxxxxxxxxxxxxxx; linux-kselftest@xxxxxxxxxxxxxxx;
> patches@xxxxxxxxxxxxxxx; jean-philippe@xxxxxxxxxx; mdf@xxxxxxxxxx;
> mshavit@xxxxxxxxxx; Shameerali Kolothum Thodi
> <shameerali.kolothum.thodi@xxxxxxxxxx>; smostafa@xxxxxxxxxx;
> ddutile@xxxxxxxxxx
> Subject: [PATCH RFCv2 00/13] iommu: Add MSI mapping support with
> nested SMMU
> 
> [ Background ]
> On ARM GIC systems and others, the target address of the MSI is translated
> by the IOMMU. For GIC, the MSI address page is called "ITS" page. When
> the
> IOMMU is disabled, the MSI address is programmed to the physical location
> of the GIC ITS page (e.g. 0x20200000). When the IOMMU is enabled, the ITS
> page is behind the IOMMU, so the MSI address is programmed to an
> allocated
> IO virtual address (a.k.a IOVA), e.g. 0xFFFF0000, which must be mapped to
> the physical ITS page: IOVA (0xFFFF0000) ===> PA (0x20200000).
> When a 2-stage translation is enabled, IOVA will be still used to program
> the MSI address, though the mappings will be in two stages:
>   IOVA (0xFFFF0000) ===> IPA (e.g. 0x80900000) ===> PA (0x20200000)
> (IPA stands for Intermediate Physical Address).
> 
> If the device that generates MSI is attached to an IOMMU_DOMAIN_DMA,
> the
> IOVA is dynamically allocated from the top of the IOVA space. If attached
> to an IOMMU_DOMAIN_UNMANAGED (e.g. a VFIO passthrough device), the
> IOVA is
> fixed to an MSI window reported by the IOMMU driver via
> IOMMU_RESV_SW_MSI,
> which is hardwired to MSI_IOVA_BASE (IOVA==0x8000000) for ARM
> IOMMUs.
> 
> So far, this IOMMU_RESV_SW_MSI works well as kernel is entirely in charge
> of the IOMMU translation (1-stage translation), since the IOVA for the ITS
> page is fixed and known by kernel. However, with virtual machine enabling
> a nested IOMMU translation (2-stage), a guest kernel directly controls the
> stage-1 translation with an IOMMU_DOMAIN_DMA, mapping a vITS page (at
> an
> IPA 0x80900000) onto its own IOVA space (e.g. 0xEEEE0000). Then, the host
> kernel can't know that guest-level IOVA to program the MSI address.
> 
> There have been two approaches to solve this problem:
> 1. Create an identity mapping in the stage-1. VMM could insert a few RMRs
>    (Reserved Memory Regions) in guest's IORT. Then the guest kernel would
>    fetch these RMR entries from the IORT and create an
> IOMMU_RESV_DIRECT
>    region per iommu group for a direct mapping. Eventually, the mappings
>    would look like: IOVA (0x8000000) === IPA (0x8000000) ===> 0x20200000
>    This requires an IOMMUFD ioctl for kernel and VMM to agree on the IPA.
> 2. Forward the guest-level MSI IOVA captured by VMM to the host-level GIC
>    driver, to program the correct MSI IOVA. Forward the VMM-defined vITS
>    page location (IPA) to the kernel for the stage-2 mapping. Eventually:
>    IOVA (0xFFFF0000) ===> IPA (0x80900000) ===> PA (0x20200000)
>    This requires a VFIO ioctl (for IOVA) and an IOMMUFD ioctl (for IPA).
> 
> Worth mentioning that when Eric Auger was working on the same topic
> with
> the VFIO iommu uAPI, he had the approach (2) first, and then switched to
> the approach (1), suggested by Jean-Philippe for reduction of complexity.
> 
> The approach (1) basically feels like the existing VFIO passthrough that
> has a 1-stage mapping for the unmanaged domain, yet only by shifting the
> MSI mapping from stage 1 (guest-has-no-iommu case) to stage 2 (guest-has-
> iommu case). So, it could reuse the existing IOMMU_RESV_SW_MSI piece,
> by
> sharing the same idea of "VMM leaving everything to the kernel".
> 
> The approach (2) is an ideal solution, yet it requires additional effort
> for kernel to be aware of the 1-stage gIOVA(s) and 2-stage IPAs for vITS
> page(s), which demands VMM to closely cooperate.
>  * It also brings some complicated use cases to the table where the host
>    or/and guest system(s) has/have multiple ITS pages.

I had done some basic sanity tests with this series and the Qemu branches you
provided on a HiSilicon hardwrae. The basic dev assignment works fine. I will 
rebase my Qemu smuv3-accel branch on top of this and will do some more tests.

One confusion I have about the above text is, do we still plan to support the
approach -1( Using RMR in Qemu) or you are just mentioning it here because
it is still possible to make use of that. I think from previous discussions the 
argument was to adopt a more dedicated MSI pass-through model which I
think is  approach-2 here.  Could you please confirm.

Thanks,
Shameer