Re: [PATCH] iommu/io-pgtable-arm: Remove split on unmap behavior

Jason Gunthorpe <jgg@xxxxxxxxxx> · Thu, 24 Oct 2024 10:44:11 -0300

On Thu, Oct 24, 2024 at 02:05:53PM +0100, Will Deacon wrote:

> My recollection is hazy, but I seem to remember VFIO using the largest
> page sizes in the IOMMU 'pgsize_bitmap' for map() requests but then
> using the smallest page size for unmap() requests, so you'd end up
> cracking block mappings when tearing down a VM with assigne devices.
>
> Is this what you're referring to when you say?

Sounds like it, but I'm really hazy on the long ago history here.

>   > Long ago VFIO could trigger a path like this, today I know of no user of
>   > this functionality.
> 
> If so, please can you provide a reference to the patch that moved VFIO
> off that problematic behaviour?

Looking more deeply, I'm not really sure VFIO ever required this.

vfio commit 166fd7d94afd ("vfio: hugepage support for
vfio_iommu_type1") is the thing that added the huge page support, and
it called map like:

+               ret = iommu_map(iommu->domain, iova,
+                               (phys_addr_t)pfn << PAGE_SHIFT,
+                               npage << PAGE_SHIFT, prot);

But then the unmap path still looked like:

+               unmapped = iommu_unmap(iommu->domain, iova, PAGE_SIZE);
+               unlocked += vfio_unpin_pages(phys >> PAGE_SHIFT,
+                                            unmapped >> PAGE_SHIFT,
+                                            dma->prot, false);

So at that time it was relying on the "unmap smaller gives the larger
size" trick that we see Intel and AMD implementing today. No
requirement for split, but the ARM split code could be triggered.

Next came the introduction of VFIO_TYPE1v2_IOMMU which eliminated the
ability for userspace to request splitting a mapping. Userspace can
only unmap what userspace maps. commit 1ef3e2bc0422
("vfio/iommu_type1: Multi-IOMMU domain support")

    To do this, our DMA tracking needs to change.  We currently try to
    coalesce user mappings into as few tracking entries as possible.  The
    problem then becomes that we lose granularity of user mappings.  We've
    never guaranteed that a user is able to unmap at a finer granularity
    than the original mapping, but we must honor the granularity of the
    original mapping.  This coalescing code is therefore removed, allowing
    only unmaps covering complete maps.  The change in accounting is
    fairly small here, a typical QEMU VM will start out with roughly a
    dozen entries, so it's arguable if this coalescing was ever needed.

That blocked any requirement for splitting driven by the uAPI. Noting
uAPI based splitting never worked right and AFAICT AMD didn't
implement split at the time.

Finally, commit 6fe1010d6d9c ("vfio/type1: DMA unmap chunking")
changed the unmap loop to this:

-               unmapped = iommu_unmap(domain->domain, iova, PAGE_SIZE);
+               /*
+                * To optimize for fewer iommu_unmap() calls, each of which
+                * may require hardware cache flushing, try to find the
+                * largest contiguous physical memory chunk to unmap.
+                */
+               for (len = PAGE_SIZE;
+                    !domain->fgsp && iova + len < end; len += PAGE_SIZE) {
+                       next = iommu_iova_to_phys(domain->domain, iova + len);
+                       if (next != phys + len)
+                               break;
+               }
+
+               unmapped = iommu_unmap(domain->domain, iova, len);

fgsp=true is only possible on AMD, and this loop effectively
guarantees that the iommu driver will never see a partial huge page
unmap as the size is discovered via iommu_iova_to_phys() before
calling unmap.

At that point only the AMD driver will ever see a page size that is
smaller than what is in the radix tree.

Jason