Re: [PATCH 00/19] mm: Support huge pfnmaps

Jason Gunthorpe <jgg@xxxxxxxxxx> · Wed, 14 Aug 2024 09:37:15 -0300

On Fri, Aug 09, 2024 at 12:08:50PM -0400, Peter Xu wrote:
> Overview
> ========
> 
> This series is based on mm-unstable, commit 98808d08fc0f of Aug 7th latest,
> plus dax 1g fix [1].  Note that this series should also apply if without
> the dax 1g fix series, but when without it, mprotect() will trigger similar
> errors otherwise on PUD mappings.
> 
> This series implements huge pfnmaps support for mm in general.  Huge pfnmap
> allows e.g. VM_PFNMAP vmas to map in either PMD or PUD levels, similar to
> what we do with dax / thp / hugetlb so far to benefit from TLB hits.  Now
> we extend that idea to PFN mappings, e.g. PCI MMIO bars where it can grow
> as large as 8GB or even bigger.

FWIW, I've started to hear people talk about needing this in the VFIO
context with VMs.

vfio/iommufd will reassemble the contiguous range from the 4k PFNs to
setup the IOMMU, but KVM is not able to do it so reliably. There is a
notable performance gap with two dimensional paging between 4k and 1G
entries in the KVM table. The platforms are being architected with the
assumption that 1G TLB entires will be used throughout the hypervisor
environment.

> Currently, only x86_64 (1G+2M) and arm64 (2M) are supported.  

There is definitely interest here in extending ARM to support the 1G
size too, what is missing?

> The other trick is how to allow gup-fast working for such huge mappings
> even if there's no direct sign of knowing whether it's a normal page or
> MMIO mapping.  This series chose to keep the pte_special solution, so that
> it reuses similar idea on setting a special bit to pfnmap PMDs/PUDs so that
> gup-fast will be able to identify them and fail properly.

Make sense

> More architectures / More page sizes
> ------------------------------------
> 
> Currently only x86_64 (2M+1G) and arm64 (2M) are supported.
> 
> For example, if arm64 can start to support THP_PUD one day, the huge pfnmap
> on 1G will be automatically enabled.

Oh that sounds like a bigger step..

> VFIO is so far the only consumer for the huge pfnmaps after this series
> applied.  Besides above remap_pfn_range() generic optimization, device
> driver can also try to optimize its mmap() on a better VA alignment for
> either PMD/PUD sizes.  This may, iiuc, normally require userspace changes,
> as the driver doesn't normally decide the VA to map a bar.  But I don't
> think I know all the drivers to know the full picture.

How does alignment work? In most caes I'm aware of the userspace does
not use MAP_FIXED so the expectation would be for the kernel to
automatically select a high alignment. I suppose your cases are
working because qemu uses MAP_FIXED and naturally aligns the BAR
addresses?

> - x86_64 + AMD GPU
>   - Needs Alex's modified QEMU to guarantee proper VA alignment to make
>     sure all pages to be mapped with PUDs

Oh :(

Jason