On Thu, Oct 24, 2024 at 02:05:53PM +0100, Will Deacon wrote: > My recollection is hazy, but I seem to remember VFIO using the largest > page sizes in the IOMMU 'pgsize_bitmap' for map() requests but then > using the smallest page size for unmap() requests, so you'd end up > cracking block mappings when tearing down a VM with assigne devices. > > Is this what you're referring to when you say? Sounds like it, but I'm really hazy on the long ago history here. > > Long ago VFIO could trigger a path like this, today I know of no user of > > this functionality. > > If so, please can you provide a reference to the patch that moved VFIO > off that problematic behaviour? Looking more deeply, I'm not really sure VFIO ever required this. vfio commit 166fd7d94afd ("vfio: hugepage support for vfio_iommu_type1") is the thing that added the huge page support, and it called map like: + ret = iommu_map(iommu->domain, iova, + (phys_addr_t)pfn << PAGE_SHIFT, + npage << PAGE_SHIFT, prot); But then the unmap path still looked like: + unmapped = iommu_unmap(iommu->domain, iova, PAGE_SIZE); + unlocked += vfio_unpin_pages(phys >> PAGE_SHIFT, + unmapped >> PAGE_SHIFT, + dma->prot, false); So at that time it was relying on the "unmap smaller gives the larger size" trick that we see Intel and AMD implementing today. No requirement for split, but the ARM split code could be triggered. Next came the introduction of VFIO_TYPE1v2_IOMMU which eliminated the ability for userspace to request splitting a mapping. Userspace can only unmap what userspace maps. commit 1ef3e2bc0422 ("vfio/iommu_type1: Multi-IOMMU domain support") To do this, our DMA tracking needs to change. We currently try to coalesce user mappings into as few tracking entries as possible. The problem then becomes that we lose granularity of user mappings. We've never guaranteed that a user is able to unmap at a finer granularity than the original mapping, but we must honor the granularity of the original mapping. This coalescing code is therefore removed, allowing only unmaps covering complete maps. The change in accounting is fairly small here, a typical QEMU VM will start out with roughly a dozen entries, so it's arguable if this coalescing was ever needed. That blocked any requirement for splitting driven by the uAPI. Noting uAPI based splitting never worked right and AFAICT AMD didn't implement split at the time. Finally, commit 6fe1010d6d9c ("vfio/type1: DMA unmap chunking") changed the unmap loop to this: - unmapped = iommu_unmap(domain->domain, iova, PAGE_SIZE); + /* + * To optimize for fewer iommu_unmap() calls, each of which + * may require hardware cache flushing, try to find the + * largest contiguous physical memory chunk to unmap. + */ + for (len = PAGE_SIZE; + !domain->fgsp && iova + len < end; len += PAGE_SIZE) { + next = iommu_iova_to_phys(domain->domain, iova + len); + if (next != phys + len) + break; + } + + unmapped = iommu_unmap(domain->domain, iova, len); fgsp=true is only possible on AMD, and this loop effectively guarantees that the iommu driver will never see a partial huge page unmap as the size is discovered via iommu_iova_to_phys() before calling unmap. At that point only the AMD driver will ever see a page size that is smaller than what is in the radix tree. Jason