On Fri, Jul 14, 2023 at 10:17 AM Ryan Roberts <ryan.roberts@xxxxxxx> wrote: > > Introduce FLEXIBLE_THP feature, which allows anonymous memory to be > allocated in large folios of a determined order. All pages of the large > folio are pte-mapped during the same page fault, significantly reducing > the number of page faults. The number of per-page operations (e.g. ref > counting, rmap management lru list management) are also significantly > reduced since those ops now become per-folio. > > The new behaviour is hidden behind the new FLEXIBLE_THP Kconfig, which > defaults to disabled for now; The long term aim is for this to defaut to > enabled, but there are some risks around internal fragmentation that > need to be better understood first. > > When enabled, the folio order is determined as such: For a vma, process > or system that has explicitly disabled THP, we continue to allocate > order-0. THP is most likely disabled to avoid any possible internal > fragmentation so we honour that request. > > Otherwise, the return value of arch_wants_pte_order() is used. For vmas > that have not explicitly opted-in to use transparent hugepages (e.g. > where thp=madvise and the vma does not have MADV_HUGEPAGE), then > arch_wants_pte_order() is limited by the new cmdline parameter, > `flexthp_unhinted_max`. This allows for a performance boost without > requiring any explicit opt-in from the workload while allowing the > sysadmin to tune between performance and internal fragmentation. > > arch_wants_pte_order() can be overridden by the architecture if desired. > Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous > set of ptes map physically contigious, naturally aligned memory, so this > mechanism allows the architecture to optimize as required. > > If the preferred order can't be used (e.g. because the folio would > breach the bounds of the vma, or because ptes in the region are already > mapped) then we fall back to a suitable lower order; first > PAGE_ALLOC_COSTLY_ORDER, then order-0. > > Signed-off-by: Ryan Roberts <ryan.roberts@xxxxxxx> > --- > .../admin-guide/kernel-parameters.txt | 10 + > mm/Kconfig | 10 + > mm/memory.c | 187 ++++++++++++++++-- > 3 files changed, 190 insertions(+), 17 deletions(-) > > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt > index a1457995fd41..405d624e2191 100644 > --- a/Documentation/admin-guide/kernel-parameters.txt > +++ b/Documentation/admin-guide/kernel-parameters.txt > @@ -1497,6 +1497,16 @@ > See Documentation/admin-guide/sysctl/net.rst for > fb_tunnels_only_for_init_ns > > + flexthp_unhinted_max= > + [KNL] Requires CONFIG_FLEXIBLE_THP enabled. The maximum > + folio size that will be allocated for an anonymous vma > + that has neither explicitly opted in nor out of using > + transparent hugepages. The size must be a power-of-2 in > + the range [PAGE_SIZE, PMD_SIZE). A larger size improves > + performance by reducing page faults, while a smaller > + size reduces internal fragmentation. Default: max(64K, > + PAGE_SIZE). Format: size[KMG]. > + Let's split this parameter into a separate patch. And I'm going to ask many questions about it (I can live with a sysctl parameter but this boot parameter is unacceptable to me). > diff --git a/mm/memory.c b/mm/memory.c > index 01f39e8144ef..e8bc729efb9d 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -4050,6 +4050,148 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > return ret; > } > > +static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages) > +{ > + int i; > + > + if (nr_pages == 1) > + return vmf_pte_changed(vmf); > + > + for (i = 0; i < nr_pages; i++) { > + if (!pte_none(ptep_get_lockless(vmf->pte + i))) > + return true; > + } > + > + return false; > +} > + > +#ifdef CONFIG_FLEXIBLE_THP > +static int flexthp_unhinted_max_order = > + ilog2(SZ_64K > PAGE_SIZE ? SZ_64K : PAGE_SIZE) - PAGE_SHIFT; > + > +static int __init parse_flexthp_unhinted_max(char *s) > +{ > + unsigned long long size = memparse(s, NULL); > + > + if (!is_power_of_2(size) || size < PAGE_SIZE || size > PMD_SIZE) { > + pr_warn("flexthp: flexthp_unhinted_max=%s must be power-of-2 between PAGE_SIZE (%lu) and PMD_SIZE (%lu), ignoring\n", > + s, PAGE_SIZE, PMD_SIZE); > + return 1; > + } > + > + flexthp_unhinted_max_order = ilog2(size) - PAGE_SHIFT; > + > + /* THP machinery requires at least 3 struct pages for meta data. */ > + if (flexthp_unhinted_max_order == 1) > + flexthp_unhinted_max_order--; > + > + return 1; > +} > + > +__setup("flexthp_unhinted_max=", parse_flexthp_unhinted_max); > + > +static int anon_folio_order(struct vm_area_struct *vma) > +{ > + int order; > + > + /* > + * If THP is explicitly disabled for either the vma, the process or the > + * system, then this is very likely intended to limit internal > + * fragmentation; in this case, don't attempt to allocate a large > + * anonymous folio. > + * > + * Else, if the vma is eligible for thp, allocate a large folio of the > + * size preferred by the arch. Or if the arch requested a very small > + * size or didn't request a size, then use PAGE_ALLOC_COSTLY_ORDER, > + * which still meets the arch's requirements but means we still take > + * advantage of SW optimizations (e.g. fewer page faults). > + * > + * Finally if thp is enabled but the vma isn't eligible, take the > + * arch-preferred size and limit it to the flexthp_unhinted_max cmdline > + * parameter. This allows a sysadmin to tune performance vs internal > + * fragmentation. > + */ > + > + if ((vma->vm_flags & VM_NOHUGEPAGE) || > + test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags) || > + !hugepage_flags_enabled()) > + order = 0; > + else { > + order = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER); > + > + if (!hugepage_vma_check(vma, vma->vm_flags, false, true, true)) > + order = min(order, flexthp_unhinted_max_order); > + } > + > + return order; > +} > + > +static int alloc_anon_folio(struct vm_fault *vmf, struct folio **folio) > +{ > + int i; > + gfp_t gfp; > + pte_t *pte; > + unsigned long addr; > + struct vm_area_struct *vma = vmf->vma; > + int prefer = anon_folio_order(vma); > + int orders[] = { > + prefer, > + prefer > PAGE_ALLOC_COSTLY_ORDER ? PAGE_ALLOC_COSTLY_ORDER : 0, > + 0, > + }; > + > + *folio = NULL; > + > + if (vmf_orig_pte_uffd_wp(vmf)) > + goto fallback; > + > + for (i = 0; orders[i]; i++) { > + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]); > + if (addr >= vma->vm_start && > + addr + (PAGE_SIZE << orders[i]) <= vma->vm_end) > + break; > + } > + > + if (!orders[i]) > + goto fallback; > + > + pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK); > + if (!pte) > + return -EAGAIN; It would be a bug if this happens. So probably -EINVAL? > + > + for (; orders[i]; i++) { > + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]); > + vmf->pte = pte + pte_index(addr); > + if (!vmf_pte_range_changed(vmf, 1 << orders[i])) > + break; > + } > + > + vmf->pte = NULL; > + pte_unmap(pte); > + > + gfp = vma_thp_gfp_mask(vma); > + > + for (; orders[i]; i++) { > + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]); > + *folio = vma_alloc_folio(gfp, orders[i], vma, addr, true); > + if (*folio) { > + clear_huge_page(&(*folio)->page, addr, 1 << orders[i]); > + return 0; > + } > + } > + > +fallback: > + *folio = vma_alloc_zeroed_movable_folio(vma, vmf->address); > + return *folio ? 0 : -ENOMEM; > +} > +#else > +static inline int alloc_anon_folio(struct vm_fault *vmf, struct folio **folio) Drop "inline" (it doesn't do anything in .c). The rest looks good to me.