The patch titled Subject: mm: thp: batch-collapse PMD with set_ptes() has been added to the -mm mm-unstable branch. Its filename is mm-thp-batch-collapse-pmd-with-set_ptes.patch This patch will shortly appear at https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/mm-thp-batch-collapse-pmd-with-set_ptes.patch This patch will later appear in the mm-unstable branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's *** Remember to use Documentation/process/submit-checklist.rst when testing your code *** The -mm tree is included into linux-next via the mm-everything branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm and is updated there every 2-3 working days ------------------------------------------------------ From: Ryan Roberts <ryan.roberts@xxxxxxx> Subject: mm: thp: batch-collapse PMD with set_ptes() Date: Mon, 18 Dec 2023 10:50:45 +0000 Patch series "Transparent Contiguous PTEs for User Mappings", v4. This is a series to opportunistically and transparently use contpte mappings (set the contiguous bit in ptes) for user memory when those mappings meet the requirements. It is part of a wider effort to improve performance by allocating and mapping variable-sized blocks of memory (folios). One aim is for the 4K kernel to approach the performance of the 16K kernel, but without breaking compatibility and without the associated increase in memory. Another aim is to benefit the 16K and 64K kernels by enabling 2M THP, since this is the contpte size for those kernels. We have good performance data that demonstrates both aims are being met (see below). Of course this is only one half of the change. We require the mapped physical memory to be the correct size and alignment for this to actually be useful (i.e. 64K for 4K pages, or 2M for 16K/64K pages). Fortunately folios are solving this problem for us. Filesystems that support it (XFS, AFS, EROFS, tmpfs, ...) will allocate large folios up to the PMD size today, and more filesystems are coming. And the other half of my work, to enable "multi-size THP" (large folios) for anonymous memory, makes contpte sized folios prevalent for anonymous memory too [4]. Note that the first 3 patchs are for core-mm and provides the refactoring to make some crucial optimizations possible - which are then implemented in patches 15 and 16. The remaining patches are arm64-specific. Testing ======= I've tested this series together with multi-size THP [4] on both Ampere Altra (bare metal) and Apple M2 (VM): - mm selftests (inc new tests written for multi-size THP); no regressions - Speedometer Java script benchmark in Chromium web browser; no issues - Kernel compilation; no issues - Various tests under high memory pressure with swap enabled; no issues Performance =========== High Level Use Cases ~~~~~~~~~~~~~~~~~~~~ First some high level use cases (kernel compilation and speedometer JavaScript benchmarks). These are running on Ampere Altra (I've seen similar improvements on Android/Pixel 6). baseline: mm-unstable (inc mTHP but switched off) mTHP: enable 16K, 32K, 64K mTHP sizes "always" mTHP + contpte: + this series mTHP + contpte + exefolio: + poc patch to always read executable memory from file into 64K folio to enable contpte-mapping the text Kernel Compilation with -j8 (negative is faster): | kernel | real-time | kern-time | user-time | |---------------------------|-----------|-----------|-----------| | baseline | 0.0% | 0.0% | 0.0% | | mTHP | -4.6% | -38.0% | -0.4% | | mTHP + contpte | -5.4% | -37.7% | -1.3% | | mTHP + contpte + exefolio | -7.4% | -39.5% | -3.5% | Kernel Compilation with -j80 (negative is faster): | kernel | real-time | kern-time | user-time | |---------------------------|-----------|-----------|-----------| | baseline | 0.0% | 0.0% | 0.0% | | mTHP | -4.9% | -36.1% | -0.2% | | mTHP + contpte | -5.8% | -36.0% | -1.2% | | mTHP + contpte + exefolio | -6.8% | -37.0% | -3.1% | Speedometer (positive is faster): | kernel | runs_per_min | |:--------------------------|--------------| | baseline | 0.0% | | mTHP | 1.5% | | mTHP + contpte | 3.7% | | mTHP + contpte + exefolio | 4.9% | Micro Benchmarks ~~~~~~~~~~~~~~~~ Additionally for this version, I've done a significant amount of microbenchmarking (and fixes!) to ensure the performance of fork(), madvise(DONTNEED) and munmap() do not regress. Thanks to David for sharing his benchmarks. baseline: mm-unstable (inc mTHP but switched off) contpte-dis: + this series with ARM64_CONTPTE disabled at compile-time (to show impact of the core-mm changes) contpte-ena: + ARM64_CONTPTE enabled at compile-time (to show impact of arm64-specific changes) I'm showing the collated results summary here. See individual patch commit logs for commentary: | Apple M2 VM | fork | dontneed | munmap | | order-0 |-------------------|-------------------|-------------------| | (pte-map) | mean | stdev | mean | stdev | mean | stdev | |---------------|---------|---------|---------|---------|---------|---------| | baseline | 0.0% | 1.1% | 0.0% | 7.5% | 0.0% | 3.8% | | contpte-dis | -1.0% | 2.0% | -9.6% | 3.1% | -1.9% | 0.2% | | contpte-ena | 2.6% | 1.7% | -10.2% | 1.6% | 1.9% | 0.7% | | Apple M2 VM | fork | dontneed | munmap | | order-9 |-------------------|-------------------|-------------------| | (pte-map) | mean | stdev | mean | stdev | mean | stdev | |---------------|---------|---------|---------|---------|---------|---------| | baseline | 0.0% | 1.2% | 0.0% | 7.9% | 0.0% | 6.4% | | contpte-dis | -0.1% | 1.1% | -4.9% | 8.1% | -4.7% | 0.8% | | contpte-ena | -25.4% | 1.9% | -9.9% | 0.9% | -6.0% | 1.4% | | Ampere Altra | fork | dontneed | munmap | | order-0 |-------------------|-------------------|-------------------| | (pte-map) | mean | stdev | mean | stdev | mean | stdev | |---------------|---------|---------|---------|---------|---------|---------| | baseline | 0.0% | 1.0% | 0.0% | 0.1% | 0.0% | 0.9% | | contpte-dis | -0.1% | 1.2% | -0.2% | 0.1% | -0.2% | 0.6% | | contpte-ena | 1.8% | 0.7% | 1.3% | 0.0% | 2.0% | 0.4% | | Ampere Altra | fork | dontneed | munmap | | order-9 |-------------------|-------------------|-------------------| | (pte-map) | mean | stdev | mean | stdev | mean | stdev | |---------------|---------|---------|---------|---------|---------|---------| | baseline | 0.0% | 0.1% | 0.0% | 0.0% | 0.0% | 0.1% | | contpte-dis | -0.1% | 0.1% | -0.1% | 0.0% | -3.2% | 0.2% | | contpte-ena | -6.7% | 0.1% | 14.1% | 0.0% | -0.6% | 0.2% | Misc ~~~~ John Hubbard at Nvidia has indicated dramatic 10x performance improvements for some workloads at [5], when using 64K base page kernel. [1] https://lore.kernel.org/linux-arm-kernel/20230622144210.2623299-1-ryan.roberts@xxxxxxx/ [2] https://lore.kernel.org/linux-arm-kernel/20231115163018.1303287-1-ryan.roberts@xxxxxxx/ [3] https://lore.kernel.org/linux-arm-kernel/20231204105440.61448-1-ryan.roberts@xxxxxxx/ [4] https://lore.kernel.org/linux-arm-kernel/20231204102027.57185-1-ryan.roberts@xxxxxxx/ [5] https://lore.kernel.org/linux-mm/c507308d-bdd4-5f9e-d4ff-e96e4520be85@xxxxxxxxxx/ This patch (of 16)L Refactor __split_huge_pmd_locked() so that a present PMD can be collapsed to PTEs in a single batch using set_ptes(). It also provides a future opportunity to batch-add the folio to the rmap using David's new batched rmap APIs. This should improve performance a little bit, but the real motivation is to remove the need for the arm64 backend to have to fold the contpte entries. Instead, since the ptes are set as a batch, the contpte blocks can be initially set up pre-folded (once the arm64 contpte support is added in the next few patches). This leads to noticeable performance improvement during split. Link: https://lkml.kernel.org/r/20231218105100.172635-1-ryan.roberts@xxxxxxx Link: https://lkml.kernel.org/r/20231218105100.172635-2-ryan.roberts@xxxxxxx Signed-off-by: Ryan Roberts <ryan.roberts@xxxxxxx> Cc: Alexander Potapenko <glider@xxxxxxxxxx> Cc: Alistair Popple <apopple@xxxxxxxxxx> Cc: Andrey Konovalov <andreyknvl@xxxxxxxxx> Cc: Andrey Ryabinin <ryabinin.a.a@xxxxxxxxx> Cc: Anshuman Khandual <anshuman.khandual@xxxxxxx> Cc: Ard Biesheuvel <ardb@xxxxxxxxxx> Cc: Barry Song <21cnbao@xxxxxxxxx> Cc: Catalin Marinas <catalin.marinas@xxxxxxx> Cc: David Hildenbrand <david@xxxxxxxxxx> Cc: Dmitry Vyukov <dvyukov@xxxxxxxxxx> Cc: James Morse <james.morse@xxxxxxx> Cc: John Hubbard <jhubbard@xxxxxxxxxx> Cc: Kefeng Wang <wangkefeng.wang@xxxxxxxxxx> Cc: Marc Zyngier <maz@xxxxxxxxxx> Cc: Mark Rutland <mark.rutland@xxxxxxx> Cc: Matthew Wilcox <willy@xxxxxxxxxxxxx> Cc: Oliver Upton <oliver.upton@xxxxxxxxx> Cc: Suzuki Poulouse <suzuki.poulose@xxxxxxx> Cc: Vincenzo Frascino <vincenzo.frascino@xxxxxxx> Cc: Will Deacon <will@xxxxxxxxxx> Cc: Yang Shi <shy828301@xxxxxxxxx> Cc: Yu Zhao <yuzhao@xxxxxxxxxx> Cc: Zenghui Yu <yuzenghui@xxxxxxxxxx> Cc: Zi Yan <ziy@xxxxxxxxxx> Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> --- mm/huge_memory.c | 59 +++++++++++++++++++++++++-------------------- 1 file changed, 34 insertions(+), 25 deletions(-) --- a/mm/huge_memory.c~mm-thp-batch-collapse-pmd-with-set_ptes +++ a/mm/huge_memory.c @@ -2535,15 +2535,16 @@ static void __split_huge_pmd_locked(stru pte = pte_offset_map(&_pmd, haddr); VM_BUG_ON(!pte); - for (i = 0, addr = haddr; i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE) { - pte_t entry; - /* - * Note that NUMA hinting access restrictions are not - * transferred to avoid any possibility of altering - * permissions across VMAs. - */ - if (freeze || pmd_migration) { + + /* + * Note that NUMA hinting access restrictions are not transferred to + * avoid any possibility of altering permissions across VMAs. + */ + if (freeze || pmd_migration) { + for (i = 0, addr = haddr; i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE) { + pte_t entry; swp_entry_t swp_entry; + if (write) swp_entry = make_writable_migration_entry( page_to_pfn(page + i)); @@ -2562,28 +2563,36 @@ static void __split_huge_pmd_locked(stru entry = pte_swp_mksoft_dirty(entry); if (uffd_wp) entry = pte_swp_mkuffd_wp(entry); - } else { - entry = mk_pte(page + i, READ_ONCE(vma->vm_page_prot)); - if (write) - entry = pte_mkwrite(entry, vma); + + VM_WARN_ON(!pte_none(ptep_get(pte + i))); + set_pte_at(mm, addr, pte + i, entry); + } + } else { + pte_t entry; + + entry = mk_pte(page, READ_ONCE(vma->vm_page_prot)); + if (write) + entry = pte_mkwrite(entry, vma); + if (!young) + entry = pte_mkold(entry); + /* NOTE: this may set soft-dirty too on some archs */ + if (dirty) + entry = pte_mkdirty(entry); + if (soft_dirty) + entry = pte_mksoft_dirty(entry); + if (uffd_wp) + entry = pte_mkuffd_wp(entry); + + for (i = 0, addr = haddr; i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE) { if (anon_exclusive) SetPageAnonExclusive(page + i); - if (!young) - entry = pte_mkold(entry); - /* NOTE: this may set soft-dirty too on some archs */ - if (dirty) - entry = pte_mkdirty(entry); - if (soft_dirty) - entry = pte_mksoft_dirty(entry); - if (uffd_wp) - entry = pte_mkuffd_wp(entry); page_add_anon_rmap(page + i, vma, addr, RMAP_NONE); + VM_WARN_ON(!pte_none(ptep_get(pte + i))); } - VM_BUG_ON(!pte_none(ptep_get(pte))); - set_pte_at(mm, addr, pte, entry); - pte++; + + set_ptes(mm, haddr, pte, entry, HPAGE_PMD_NR); } - pte_unmap(pte - 1); + pte_unmap(pte); if (!pmd_migration) page_remove_rmap(page, vma, true); _ Patches currently in -mm which might be from ryan.roberts@xxxxxxx are mm-allow-deferred-splitting-of-arbitrary-anon-large-folios.patch mm-non-pmd-mappable-large-folios-for-folio_add_new_anon_rmap.patch mm-thp-introduce-multi-size-thp-sysfs-interface.patch mm-thp-introduce-multi-size-thp-sysfs-interface-fix.patch mm-thp-support-allocation-of-anonymous-multi-size-thp.patch mm-thp-support-allocation-of-anonymous-multi-size-thp-fix.patch selftests-mm-kugepaged-restore-thp-settings-at-exit.patch selftests-mm-factor-out-thp-settings-management.patch selftests-mm-support-multi-size-thp-interface-in-thp_settings.patch selftests-mm-khugepaged-enlighten-for-multi-size-thp.patch selftests-mm-cow-generalize-do_run_with_thp-helper.patch selftests-mm-cow-add-tests-for-anonymous-multi-size-thp.patch mm-thp-batch-collapse-pmd-with-set_ptes.patch mm-batch-copy-pte-ranges-during-fork.patch mm-batch-clear-pte-ranges-during-zap_pte_range.patch arm64-mm-set_pte-new-layer-to-manage-contig-bit.patch arm64-mm-set_ptes-set_pte_at-new-layer-to-manage-contig-bit.patch arm64-mm-pte_clear-new-layer-to-manage-contig-bit.patch arm64-mm-ptep_get_and_clear-new-layer-to-manage-contig-bit.patch arm64-mm-ptep_test_and_clear_young-new-layer-to-manage-contig-bit.patch arm64-mm-ptep_clear_flush_young-new-layer-to-manage-contig-bit.patch arm64-mm-ptep_set_wrprotect-new-layer-to-manage-contig-bit.patch arm64-mm-ptep_set_access_flags-new-layer-to-manage-contig-bit.patch arm64-mm-ptep_get-new-layer-to-manage-contig-bit.patch arm64-mm-split-__flush_tlb_range-to-elide-trailing-dsb.patch arm64-mm-wire-up-pte_cont-for-user-mappings.patch arm64-mm-implement-new-helpers-to-optimize-fork.patch arm64-mm-implement-clear_ptes-to-optimize-exit-munmap-dontneed.patch selftests-mm-log-run_vmtestssh-results-in-tap-format.patch