Re: RMRR device on non-Intel platform

Robin Murphy <robin.murphy@xxxxxxx> · Fri, 21 Apr 2023 18:22:37 +0100




On 2023-04-21 13:45, Jason Gunthorpe wrote:
On Fri, Apr 21, 2023 at 01:29:46PM +0100, Robin Murphy wrote:

Can you clarify why something other than IOMMU_RESV_SW_MSI would be
needed?

We need iommufd to setup a 1:1 map for the reserved space.

So, of the reserved spaces we have these:

	/* Memory regions which must be mapped 1:1 at all times */
	IOMMU_RESV_DIRECT,

            Block iommufd

	/*
	 * Memory regions which are advertised to be 1:1 but are
	 * commonly considered relaxable in some conditions,
	 * for instance in device assignment use case (USB, Graphics)
	 */
	IOMMU_RESV_DIRECT_RELAXABLE,

            iommufd ignores this one

	/* Arbitrary "never map this or give it to a device" address ranges */
	IOMMU_RESV_RESERVED,

	   iommufd prevents using this IOVA range

	/* Hardware MSI region (untranslated) */
	IOMMU_RESV_MSI,

	   iommufd treats this the same as IOMMU_RESV_RESERVED

	/* Software-managed MSI translation window */
	IOMMU_RESV_SW_MSI,

	   iommufd treats this the same as IOMMU_RESV_RESERVED, also
	   it passes the start to iommu_get_msi_cookie() which
	   eventually maps something, but not 1:1.

I don't think it is a compatible change for IOMMU_RESV_SW_MSI to also
mean 1:1 map?

Bleh, my bad, the nested MSI stuff is right on the limit of the amount 
of horribleness I can hold in my head at once even at the best of times, 
let alone when my head is still half-full of PMUs.

I think a slightly more considered and slightly less wrong version of 
that idea is to mark it as IOMMU_RESV_MSI, and special-case 
direct-mapping those on Arm (I believe it would technically be benign to 
do on x86 too, but might annoy people with its pointlessness). However...

On baremetal we have no idea what the platform put under that
hardcoded address?

On VM we don't use the iommu_get_msi_cookie() flow because the GIC in
the VM pretends it doesn't have an ITS page?  (did I get that right?)

No, that can't be right - PCIe devices have to support MSI or MSI-X, and 
many of them won't support INTx at all, so if the guest wants to use 
interrupts in general it must surely need to believe it has some kind of 
MSI controller, which for practical purposes in this context means an 
ITS. That was the next thing I started wondering after the above - if 
the aim is to direct-map the host's SW_MSI region to effectively pass 
through the S2 MSI cookie, but you have the same Linux SMMU driver in 
the guest, isn't that guest driver still going to add a conflicting 
SW_MSI region for the same IOVA and confuse things?

Ideally for nesting, the VMM would just tell us the IPA of where it's 
going to claim the given device's associated MSI doorbell is, we map 
that to the real underlying address at S2, then the guest can use its S1 
cookie as normal if it wants to, and the host doesn't have to rewrite 
addresses either way. The set_dev_data thing starts to look tempting for 
this too... Given that the nesting usage model inherently constrains the 
VMM's options for emulating the IOMMU, would it be unreasonable to make 
our lives a lot easier with some similar constraints around interrupts, 
and just not attempt to support the full gamut of "emulate any kind of 
IRQ with any other kind of IRQ" irqfd hilarity?

MSI regions already represent "safe" direct mappings, either as an inherent
property of the hardware, or with an actual mapping maintained by software.
Also RELAXABLE is meant to imply that it is only needed until a driver takes
over the device, which at face value doesn't make much sense for interrupts.

I used "relxable" to suggest it is safe for userspace.

I know, but the subtlety is the reason *why* it's safe for userspace. 
Namely that a VFIO driver has bound and reset (or at least taken control 
of) the device, and thus it is assumed to no longer be doing whatever 
the boot firmware left it doing, therefore the reserved region is 
assumed to no longer be relevant, and from then on the requirement to 
preserve it can be relaxed.

We'll still need to set this when the default domain type is identity too -
see the diff I posted (the other parts below I merely implied).

Right, I missed that!

I suggest like this to avoid the double loop:

@@ -1037,9 +1037,6 @@ static int iommu_create_device_direct_mappings(struct iom>
         unsigned long pg_size;
         int ret = 0;
  
-       if (!iommu_is_dma_domain(domain))
-               return 0;
-
         BUG_ON(!domain->pgsize_bitmap);
  
         pg_size = 1UL << __ffs(domain->pgsize_bitmap);

But then you realise that you also need to juggle this around since 
identity domains aren't required to have a valid pgsize_bitmap either, 
give up on the idea and go straight to writing a dedicated loop as the 
clearer and tidier option because hey this is hardly a fast path anyway. 
At least, you do if you're me :)

Cheers,
Robin.

@@ -1052,13 +1049,18 @@ static int iommu_create_device_direct_mappings(struct i>
                 dma_addr_t start, end, addr;
                 size_t map_size = 0;
  
-               start = ALIGN(entry->start, pg_size);
-               end   = ALIGN(entry->start + entry->length, pg_size);
-
                 if (entry->type != IOMMU_RESV_DIRECT &&
                     entry->type != IOMMU_RESV_DIRECT_RELAXABLE)
                         continue;
  
+               if (entry->type == IOMMU_RESV_DIRECT)
+                       dev->iommu->requires_direct = 1;
+
+               if (!iommu_is_dma_domain(domain))
+                       continue;
+
+               start = ALIGN(entry->start, pg_size);
+               end   = ALIGN(entry->start + entry->length, pg_size);
                 for (addr = start; addr <= end; addr += pg_size) {
                         phys_addr_t phys_addr;

Jason