On Mon, Aug 26, 2024 at 1:44 PM Peter Xu <peterx@xxxxxxxxxx> wrote: > > v2: > - Added tags > - Let folio_walk_start() scan special pmd/pud bits [DavidH] > - Switch copy_huge_pmd() COW+writable check into a VM_WARN_ON_ONCE() > - Update commit message to drop mentioning of gup-fast, in patch "mm: Mark > special bits for huge pfn mappings when inject" [JasonG] > - In gup-fast, reorder _special check v.s. _devmap check, so as to make > pmd/pud path look the same as pte path [DavidH, JasonG] > - Enrich comments for follow_pfnmap*() API, emphasize the risk when PFN is > used after the end() is invoked, s/-ve/negative/ [JasonG, Sean] > > Overview > ======== > > This series is based on mm-unstable, commit b659edec079c of Aug 26th > latest, with patch "vma remove the unneeded avc bound with non-CoWed folio" > reverted, as reported broken [0]. > > This series implements huge pfnmaps support for mm in general. Huge pfnmap > allows e.g. VM_PFNMAP vmas to map in either PMD or PUD levels, similar to > what we do with dax / thp / hugetlb so far to benefit from TLB hits. Now > we extend that idea to PFN mappings, e.g. PCI MMIO bars where it can grow > as large as 8GB or even bigger. > > Currently, only x86_64 (1G+2M) and arm64 (2M) are supported. The last > patch (from Alex Williamson) will be the first user of huge pfnmap, so as > to enable vfio-pci driver to fault in huge pfn mappings. > > Implementation > ============== > > In reality, it's relatively simple to add such support comparing to many > other types of mappings, because of PFNMAP's specialties when there's no > vmemmap backing it, so that most of the kernel routines on huge mappings > should simply already fail for them, like GUPs or old-school follow_page() > (which is recently rewritten to be folio_walk* APIs by David). > > One trick here is that we're still unmature on PUDs in generic paths here > and there, as DAX is so far the only user. This patchset will add the 2nd > user of it. Hugetlb can be a 3rd user if the hugetlb unification work can > go on smoothly, but to be discussed later. > > The other trick is how to allow gup-fast working for such huge mappings > even if there's no direct sign of knowing whether it's a normal page or > MMIO mapping. This series chose to keep the pte_special solution, so that > it reuses similar idea on setting a special bit to pfnmap PMDs/PUDs so that > gup-fast will be able to identify them and fail properly. > > Along the way, we'll also notice that the major pgtable pfn walker, aka, > follow_pte(), will need to retire soon due to the fact that it only works > with ptes. A new set of simple API is introduced (follow_pfnmap* API) to > be able to do whatever follow_pte() can already do, plus that it can also > process huge pfnmaps now. Half of this series is about that and converting > all existing pfnmap walkers to use the new API properly. Hopefully the new > API also looks better to avoid exposing e.g. pgtable lock details into the > callers, so that it can be used in an even more straightforward way. > > Here, three more options will be introduced and involved in huge pfnmap: > > - ARCH_SUPPORTS_HUGE_PFNMAP > > Arch developers will need to select this option when huge pfnmap is > supported in arch's Kconfig. After this patchset applied, both x86_64 > and arm64 will start to enable it by default. > > - ARCH_SUPPORTS_PMD_PFNMAP / ARCH_SUPPORTS_PUD_PFNMAP > > These options are for driver developers to identify whether current > arch / config supports huge pfnmaps, making decision on whether it can > use the huge pfnmap APIs to inject them. One can refer to the last > vfio-pci patch from Alex on the use of them properly in a device > driver. > > So after the whole set applied, and if one would enable some dynamic debug > lines in vfio-pci core files, we should observe things like: > > vfio-pci 0000:00:06.0: vfio_pci_mmap_huge_fault(,order = 9) BAR 0 page offset 0x0: 0x100 > vfio-pci 0000:00:06.0: vfio_pci_mmap_huge_fault(,order = 9) BAR 0 page offset 0x200: 0x100 > vfio-pci 0000:00:06.0: vfio_pci_mmap_huge_fault(,order = 9) BAR 0 page offset 0x400: 0x100 > > In this specific case, it says that vfio-pci faults in PMDs properly for a > few BAR0 offsets. > > Patch Layout > ============ > > Patch 1: Introduce the new options mentioned above for huge PFNMAPs > Patch 2: A tiny cleanup > Patch 3-8: Preparation patches for huge pfnmap (include introduce > special bit for pmd/pud) > Patch 9-16: Introduce follow_pfnmap*() API, use it everywhere, and > then drop follow_pte() API > Patch 17: Add huge pfnmap support for x86_64 > Patch 18: Add huge pfnmap support for arm64 > Patch 19: Add vfio-pci support for all kinds of huge pfnmaps (Alex) > > TODO > ==== > > More architectures / More page sizes > ------------------------------------ > > Currently only x86_64 (2M+1G) and arm64 (2M) are supported. There seems to > have plan to support arm64 1G later on top of this series [2]. > > Any arch will need to first support THP / THP_1G, then provide a special > bit in pmds/puds to support huge pfnmaps. > > remap_pfn_range() support > ------------------------- > > Currently, remap_pfn_range() still only maps PTEs. With the new option, > remap_pfn_range() can logically start to inject either PMDs or PUDs when > the alignment requirements match on the VAs. > > When the support is there, it should be able to silently benefit all > drivers that is using remap_pfn_range() in its mmap() handler on better TLB > hit rate and overall faster MMIO accesses similar to processor on hugepages. > Hi Peter, I am curious if there is any work needed for unmap_mapping_range? If a driver hugely remap_pfn_range()ed at 1G granularity, can the driver unmap at PAGE_SIZE granularity? For example, when handling a PFN is poisoned in the 1G mapping, it would be great if the mapping can be splitted to 2M mappings + 4k mappings, so only the single poisoned PFN is lost. (Pretty much like the past proposal* to use HGM** to improve hugetlb's memory failure handling). Probably these questions can be answered after reading your code, which I plan to do, but just want to ask in case you have an easy answer for me. * https://patchwork.plctlab.org/project/linux-kernel/cover/20230428004139.2899856-1-jiaqiyan@xxxxxxxxxx/ ** https://lwn.net/Articles/912017 > More driver support > ------------------- > > VFIO is so far the only consumer for the huge pfnmaps after this series > applied. Besides above remap_pfn_range() generic optimization, device > driver can also try to optimize its mmap() on a better VA alignment for > either PMD/PUD sizes. This may, iiuc, normally require userspace changes, > as the driver doesn't normally decide the VA to map a bar. But I don't > think I know all the drivers to know the full picture. > > Tests Done > ========== > > - Cross-build tests > > - run_vmtests.sh > > - Hacked e1000e QEMU with 128MB BAR 0, with some prefault test, mprotect() > and fork() tests on the bar mapped > > - x86_64 + AMD GPU > - Needs Alex's modified QEMU to guarantee proper VA alignment to make > sure all pages to be mapped with PUDs > - Main BAR (8GB) start to use PUD mappings > - Sub BAR (??MBs?) start to use PMD mappings > - Performance wise, slight improvement comparing to the old PTE mappings > > - aarch64 + NIC > - Detached NIC test to make sure driver loads fine with PMD mappings > > Credits all go to Alex on help testing the GPU/NIC use cases above. > > Comments welcomed, thanks. > > [0] https://lore.kernel.org/r/73ad9540-3fb8-4154-9a4f-30a0a2b03d41@lucifer.local > [1] https://lore.kernel.org/r/20240807194812.819412-1-peterx@xxxxxxxxxx > [2] https://lore.kernel.org/r/498e0731-81a4-4f75-95b4-a8ad0bcc7665@xxxxxxxxxx > > Alex Williamson (1): > vfio/pci: Implement huge_fault support > > Peter Xu (18): > mm: Introduce ARCH_SUPPORTS_HUGE_PFNMAP and special bits to pmd/pud > mm: Drop is_huge_zero_pud() > mm: Mark special bits for huge pfn mappings when inject > mm: Allow THP orders for PFNMAPs > mm/gup: Detect huge pfnmap entries in gup-fast > mm/pagewalk: Check pfnmap for folio_walk_start() > mm/fork: Accept huge pfnmap entries > mm: Always define pxx_pgprot() > mm: New follow_pfnmap API > KVM: Use follow_pfnmap API > s390/pci_mmio: Use follow_pfnmap API > mm/x86/pat: Use the new follow_pfnmap API > vfio: Use the new follow_pfnmap API > acrn: Use the new follow_pfnmap API > mm/access_process_vm: Use the new follow_pfnmap API > mm: Remove follow_pte() > mm/x86: Support large pfn mappings > mm/arm64: Support large pfn mappings > > arch/arm64/Kconfig | 1 + > arch/arm64/include/asm/pgtable.h | 30 +++++ > arch/powerpc/include/asm/pgtable.h | 1 + > arch/s390/include/asm/pgtable.h | 1 + > arch/s390/pci/pci_mmio.c | 22 ++-- > arch/sparc/include/asm/pgtable_64.h | 1 + > arch/x86/Kconfig | 1 + > arch/x86/include/asm/pgtable.h | 80 +++++++----- > arch/x86/mm/pat/memtype.c | 17 ++- > drivers/vfio/pci/vfio_pci_core.c | 60 ++++++--- > drivers/vfio/vfio_iommu_type1.c | 16 +-- > drivers/virt/acrn/mm.c | 16 +-- > include/linux/huge_mm.h | 16 +-- > include/linux/mm.h | 57 ++++++++- > include/linux/pgtable.h | 12 ++ > mm/Kconfig | 13 ++ > mm/gup.c | 6 + > mm/huge_memory.c | 50 +++++--- > mm/memory.c | 183 ++++++++++++++++++++-------- > mm/pagewalk.c | 4 +- > virt/kvm/kvm_main.c | 19 ++- > 21 files changed, 425 insertions(+), 181 deletions(-) > > -- > 2.45.0 > >