On 17/07/2023 14:56, David Hildenbrand wrote: > On 17.07.23 15:20, Ryan Roberts wrote: >> On 17/07/2023 14:06, David Hildenbrand wrote: >>> On 14.07.23 19:17, Yu Zhao wrote: >>>> On Fri, Jul 14, 2023 at 10:17 AM Ryan Roberts <ryan.roberts@xxxxxxx> wrote: >>>>> >>>>> Introduce FLEXIBLE_THP feature, which allows anonymous memory to be >>>>> allocated in large folios of a determined order. All pages of the large >>>>> folio are pte-mapped during the same page fault, significantly reducing >>>>> the number of page faults. The number of per-page operations (e.g. ref >>>>> counting, rmap management lru list management) are also significantly >>>>> reduced since those ops now become per-folio. >>>>> >>>>> The new behaviour is hidden behind the new FLEXIBLE_THP Kconfig, which >>>>> defaults to disabled for now; The long term aim is for this to defaut to >>>>> enabled, but there are some risks around internal fragmentation that >>>>> need to be better understood first. >>>>> >>>>> When enabled, the folio order is determined as such: For a vma, process >>>>> or system that has explicitly disabled THP, we continue to allocate >>>>> order-0. THP is most likely disabled to avoid any possible internal >>>>> fragmentation so we honour that request. >>>>> >>>>> Otherwise, the return value of arch_wants_pte_order() is used. For vmas >>>>> that have not explicitly opted-in to use transparent hugepages (e.g. >>>>> where thp=madvise and the vma does not have MADV_HUGEPAGE), then >>>>> arch_wants_pte_order() is limited by the new cmdline parameter, >>>>> `flexthp_unhinted_max`. This allows for a performance boost without >>>>> requiring any explicit opt-in from the workload while allowing the >>>>> sysadmin to tune between performance and internal fragmentation. >>>>> >>>>> arch_wants_pte_order() can be overridden by the architecture if desired. >>>>> Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous >>>>> set of ptes map physically contigious, naturally aligned memory, so this >>>>> mechanism allows the architecture to optimize as required. >>>>> >>>>> If the preferred order can't be used (e.g. because the folio would >>>>> breach the bounds of the vma, or because ptes in the region are already >>>>> mapped) then we fall back to a suitable lower order; first >>>>> PAGE_ALLOC_COSTLY_ORDER, then order-0. >>>>> >>>>> Signed-off-by: Ryan Roberts <ryan.roberts@xxxxxxx> >>>>> --- >>>>> .../admin-guide/kernel-parameters.txt | 10 + >>>>> mm/Kconfig | 10 + >>>>> mm/memory.c | 187 ++++++++++++++++-- >>>>> 3 files changed, 190 insertions(+), 17 deletions(-) >>>>> >>>>> diff --git a/Documentation/admin-guide/kernel-parameters.txt >>>>> b/Documentation/admin-guide/kernel-parameters.txt >>>>> index a1457995fd41..405d624e2191 100644 >>>>> --- a/Documentation/admin-guide/kernel-parameters.txt >>>>> +++ b/Documentation/admin-guide/kernel-parameters.txt >>>>> @@ -1497,6 +1497,16 @@ >>>>> See Documentation/admin-guide/sysctl/net.rst for >>>>> fb_tunnels_only_for_init_ns >>>>> >>>>> + flexthp_unhinted_max= >>>>> + [KNL] Requires CONFIG_FLEXIBLE_THP enabled. The >>>>> maximum >>>>> + folio size that will be allocated for an anonymous vma >>>>> + that has neither explicitly opted in nor out of using >>>>> + transparent hugepages. The size must be a >>>>> power-of-2 in >>>>> + the range [PAGE_SIZE, PMD_SIZE). A larger size >>>>> improves >>>>> + performance by reducing page faults, while a smaller >>>>> + size reduces internal fragmentation. Default: max(64K, >>>>> + PAGE_SIZE). Format: size[KMG]. >>>>> + >>>> >>>> Let's split this parameter into a separate patch. >>>> >>> >>> Just a general comment after stumbling over patch #2, let's not start splitting >>> patches into things that don't make any sense on their own; that just makes >>> review a lot harder. >> >> ACK >> >>> >>> For this case here, I'd suggest first adding the general infrastructure and then >>> adding tunables we want to have on top. >> >> OK, so 1 patch for the main infrastructure, then a patch to disable for >> MADV_NOHUGEPAGE and friends, then a further patch to set flexthp_unhinted_max >> via a sysctl? > > MADV_NOHUGEPAGE handling for me falls under the category "required for > correctness to not break existing workloads" and has to be there initially. > > Anything that is rather a performance tunable (e.g., a sysctl to optimize) can > be added on top and discussed separately.> > At least IMHO :) > >> >>> >>> I agree that toggling that at runtime (for example via sysfs as raised by me >>> previously) would be nicer. >> >> OK, I clearly misunderstood, I thought you were requesting a boot parameter. > > Oh, sorry about that. I wanted to actually express > "/sys/kernel/mm/transparent_hugepage/" sysctls where we can toggle that later at > runtime as well. > >> What's the ABI compat guarrantee for sysctls? I assumed that for a boot >> parameter it would be easier to remove in future if we wanted, but for sysctl, >> its there forever? > > sysctl are hard/impossible to remove, yes. So we better make sure what we add > has clear semantics. > > If we ever want some real auto-tunable mode (and can actually implement it > without harming performance; and I am skeptical), we might want to allow for > setting such a parameter to "auto", for example. > >> >> Also, how do you feel about the naming and behavior of the parameter? > > Very good question. "flexthp_unhinted_max" naming is a bit suboptimal. > > For example, I'm not so sure if we should expose the feature to user space as > "flexthp" at all. I think we should find a clearer feature name to begin with. > > ... maybe we can initially get away with dropping that parameter and default to > something reasonably small (i.e., 64k as you have above)? That would certainly get my vote. But it was you who was arguing for a tunable previously ;-). I propose we use the following as the "unhinted ceiling" for now, then we can add a tunable if/when we find a use case that doesn't work optimally with this value: static int flexthp_unhinted_max_order = ilog2(SZ_64K > PAGE_SIZE ? SZ_64K : PAGE_SIZE) - PAGE_SHIFT; (Using PAGE_SIZE when its gt 64K to cover the ppc case that looks like it can support 256K pages. Open coding the max because max() can't be used outside a function). > > /sys/kernel/mm/transparent_hugepage/enabled=never and simply not get any thp. Yes, that should work with the patch as it is today. >