On 8/11/2023 10:33 PM, Zi Yan wrote: > On 11 Aug 2023, at 1:34, Yin, Fengwei wrote: > >> On 8/11/2023 9:04 AM, Zi Yan wrote: >>> On 10 Aug 2023, at 20:36, Yin, Fengwei wrote: >>> >>>> On 8/11/2023 3:46 AM, Zi Yan wrote: >>>>> On 10 Aug 2023, at 15:12, Ryan Roberts wrote: >>>>> >>>>>> On 10/08/2023 18:01, Yu Zhao wrote: >>>>>>> On Thu, Aug 10, 2023 at 8:30 AM Ryan Roberts <ryan.roberts@xxxxxxx> wrote: >>>>>>>> >>>>>>>> Introduce LARGE_ANON_FOLIO feature, which allows anonymous memory to be >>>>>>>> allocated in large folios of a determined order. All pages of the large >>>>>>>> folio are pte-mapped during the same page fault, significantly reducing >>>>>>>> the number of page faults. The number of per-page operations (e.g. ref >>>>>>>> counting, rmap management lru list management) are also significantly >>>>>>>> reduced since those ops now become per-folio. >>>>>>>> >>>>>>>> The new behaviour is hidden behind the new LARGE_ANON_FOLIO Kconfig, >>>>>>>> which defaults to disabled for now; The long term aim is for this to >>>>>>>> defaut to enabled, but there are some risks around internal >>>>>>>> fragmentation that need to be better understood first. >>>>>>>> >>>>>>>> Large anonymous folio (LAF) allocation is integrated with the existing >>>>>>>> (PMD-order) THP and single (S) page allocation according to this policy, >>>>>>>> where fallback (>) is performed for various reasons, such as the >>>>>>>> proposed folio order not fitting within the bounds of the VMA, etc: >>>>>>>> >>>>>>>> | prctl=dis | prctl=ena | prctl=ena | prctl=ena >>>>>>>> | sysfs=X | sysfs=never | sysfs=madvise | sysfs=always >>>>>>>> ----------------|-----------|-------------|---------------|------------- >>>>>>>> no hint | S | LAF>S | LAF>S | THP>LAF>S >>>>>>>> MADV_HUGEPAGE | S | LAF>S | THP>LAF>S | THP>LAF>S >>>>>>>> MADV_NOHUGEPAGE | S | S | S | S >>>>>>>> >>>>>>>> This approach ensures that we don't violate existing hints to only >>>>>>>> allocate single pages - this is required for QEMU's VM live migration >>>>>>>> implementation to work correctly - while allowing us to use LAF >>>>>>>> independently of THP (when sysfs=never). This makes wide scale >>>>>>>> performance characterization simpler, while avoiding exposing any new >>>>>>>> ABI to user space. >>>>>>>> >>>>>>>> When using LAF for allocation, the folio order is determined as follows: >>>>>>>> The return value of arch_wants_pte_order() is used. For vmas that have >>>>>>>> not explicitly opted-in to use transparent hugepages (e.g. where >>>>>>>> sysfs=madvise and the vma does not have MADV_HUGEPAGE or sysfs=never), >>>>>>>> then arch_wants_pte_order() is limited to 64K (or PAGE_SIZE, whichever >>>>>>>> is bigger). This allows for a performance boost without requiring any >>>>>>>> explicit opt-in from the workload while limitting internal >>>>>>>> fragmentation. >>>>>>>> >>>>>>>> If the preferred order can't be used (e.g. because the folio would >>>>>>>> breach the bounds of the vma, or because ptes in the region are already >>>>>>>> mapped) then we fall back to a suitable lower order; first >>>>>>>> PAGE_ALLOC_COSTLY_ORDER, then order-0. >>>>>>>> >>>>>>>> arch_wants_pte_order() can be overridden by the architecture if desired. >>>>>>>> Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous >>>>>>>> set of ptes map physically contigious, naturally aligned memory, so this >>>>>>>> mechanism allows the architecture to optimize as required. >>>>>>>> >>>>>>>> Here we add the default implementation of arch_wants_pte_order(), used >>>>>>>> when the architecture does not define it, which returns -1, implying >>>>>>>> that the HW has no preference. In this case, mm will choose it's own >>>>>>>> default order. >>>>>>>> >>>>>>>> Signed-off-by: Ryan Roberts <ryan.roberts@xxxxxxx> >>>>>>>> --- >>>>>>>> include/linux/pgtable.h | 13 ++++ >>>>>>>> mm/Kconfig | 10 +++ >>>>>>>> mm/memory.c | 144 +++++++++++++++++++++++++++++++++++++--- >>>>>>>> 3 files changed, 158 insertions(+), 9 deletions(-) >>>>>>>> >>>>>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h >>>>>>>> index 222a33b9600d..4b488cc66ddc 100644 >>>>>>>> --- a/include/linux/pgtable.h >>>>>>>> +++ b/include/linux/pgtable.h >>>>>>>> @@ -369,6 +369,19 @@ static inline bool arch_has_hw_pte_young(void) >>>>>>>> } >>>>>>>> #endif >>>>>>>> >>>>>>>> +#ifndef arch_wants_pte_order >>>>>>>> +/* >>>>>>>> + * Returns preferred folio order for pte-mapped memory. Must be in range [0, >>>>>>>> + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios >>>>>>>> + * to be at least order-2. Negative value implies that the HW has no preference >>>>>>>> + * and mm will choose it's own default order. >>>>>>>> + */ >>>>>>>> +static inline int arch_wants_pte_order(void) >>>>>>>> +{ >>>>>>>> + return -1; >>>>>>>> +} >>>>>>>> +#endif >>>>>>>> + >>>>>>>> #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR >>>>>>>> static inline pte_t ptep_get_and_clear(struct mm_struct *mm, >>>>>>>> unsigned long address, >>>>>>>> diff --git a/mm/Kconfig b/mm/Kconfig >>>>>>>> index 721dc88423c7..a1e28b8ddc24 100644 >>>>>>>> --- a/mm/Kconfig >>>>>>>> +++ b/mm/Kconfig >>>>>>>> @@ -1243,4 +1243,14 @@ config LOCK_MM_AND_FIND_VMA >>>>>>>> >>>>>>>> source "mm/damon/Kconfig" >>>>>>>> >>>>>>>> +config LARGE_ANON_FOLIO >>>>>>>> + bool "Allocate large folios for anonymous memory" >>>>>>>> + depends on TRANSPARENT_HUGEPAGE >>>>>>>> + default n >>>>>>>> + help >>>>>>>> + Use large (bigger than order-0) folios to back anonymous memory where >>>>>>>> + possible, even for pte-mapped memory. This reduces the number of page >>>>>>>> + faults, as well as other per-page overheads to improve performance for >>>>>>>> + many workloads. >>>>>>>> + >>>>>>>> endmenu >>>>>>>> diff --git a/mm/memory.c b/mm/memory.c >>>>>>>> index d003076b218d..bbc7d4ce84f7 100644 >>>>>>>> --- a/mm/memory.c >>>>>>>> +++ b/mm/memory.c >>>>>>>> @@ -4073,6 +4073,123 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) >>>>>>>> return ret; >>>>>>>> } >>>>>>>> >>>>>>>> +static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages) >>>>>>>> +{ >>>>>>>> + int i; >>>>>>>> + >>>>>>>> + if (nr_pages == 1) >>>>>>>> + return vmf_pte_changed(vmf); >>>>>>>> + >>>>>>>> + for (i = 0; i < nr_pages; i++) { >>>>>>>> + if (!pte_none(ptep_get_lockless(vmf->pte + i))) >>>>>>>> + return true; >>>>>>>> + } >>>>>>>> + >>>>>>>> + return false; >>>>>>>> +} >>>>>>>> + >>>>>>>> +#ifdef CONFIG_LARGE_ANON_FOLIO >>>>>>>> +#define ANON_FOLIO_MAX_ORDER_UNHINTED \ >>>>>>>> + (ilog2(max_t(unsigned long, SZ_64K, PAGE_SIZE)) - PAGE_SHIFT) >>>>>>>> + >>>>>>>> +static int anon_folio_order(struct vm_area_struct *vma) >>>>>>>> +{ >>>>>>>> + int order; >>>>>>>> + >>>>>>>> + /* >>>>>>>> + * If the vma is eligible for thp, allocate a large folio of the size >>>>>>>> + * preferred by the arch. Or if the arch requested a very small size or >>>>>>>> + * didn't request a size, then use PAGE_ALLOC_COSTLY_ORDER, which still >>>>>>>> + * meets the arch's requirements but means we still take advantage of SW >>>>>>>> + * optimizations (e.g. fewer page faults). >>>>>>>> + * >>>>>>>> + * If the vma isn't eligible for thp, take the arch-preferred size and >>>>>>>> + * limit it to ANON_FOLIO_MAX_ORDER_UNHINTED. This ensures workloads >>>>>>>> + * that have not explicitly opted-in take benefit while capping the >>>>>>>> + * potential for internal fragmentation. >>>>>>>> + */ >>>>>>>> + >>>>>>>> + order = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER); >>>>>>>> + >>>>>>>> + if (!hugepage_vma_check(vma, vma->vm_flags, false, true, true)) >>>>>>>> + order = min(order, ANON_FOLIO_MAX_ORDER_UNHINTED); >>>>>>>> + >>>>>>>> + return order; >>>>>>>> +} >>>>>>> >>>>>>> I don't understand why we still want to keep ANON_FOLIO_MAX_ORDER_UNHINTED. >>>>>>> 1. It's not used, since no archs at the moment implement >>>>>>> arch_wants_pte_order() that returns >64KB. >>>>>>> 2. As far as I know, there is no plan for any arch to do so. >>>>>> >>>>>> My rationale is that arm64 is planning to use this for contpte mapping 2MB >>>>>> blocks for 16K and 64K kernels. But I think we will all agree that allowing 2MB >>>>>> blocks without the proper THP hinting is a bad plan. >>>>>> >>>>>> As I see it, arches could add their own arch_wants_pte_order() at any time, and >>>>>> just because the HW has a preference, doesn't mean the SW shouldn't get a say. >>>>>> Its a negotiation between HW and SW for the LAF order, embodied in this policy. >>>>>> >>>>>>> 3. Again, it seems to me the rationale behind >>>>>>> ANON_FOLIO_MAX_ORDER_UNHINTED isn't convincing at all. >>>>>>> >>>>>>> Can we introduce ANON_FOLIO_MAX_ORDER_UNHINTED if/when needed please? >>>>>>> >>>>>>> Also you made arch_wants_pte_order() return -1, and I acknowledged [1]: >>>>>>> Thanks: -1 actually is better than 0 (what I suggested) for the >>>>>>> obvious reason. >>>>>>> >>>>>>> I thought we were on the same page, i.e., the "obvious reason" is that >>>>>>> h/w might prefer 0. But here you are not respecting 0. But then why >>>>>>> -1? >>>>>> >>>>>> I agree that the "obvious reason" is that HW might prefer order-0. But the >>>>>> performance wins don't come solely from the HW. Batching up page faults is a big >>>>>> win for SW even if the HW doesn't benefit. So I think it is important that a HW >>>>>> preference of order-0 is possible to express through this API. But that doesn't >>>>>> mean that we don't listen to SW's preferences either. >>>>>> >>>>>> I would really rather leave it in; As I've mentioned in the past, we have a >>>>>> partner who is actively keen to take advantage of 2MB blocks with 64K kernel and >>>>>> this is the mechanism that means we don't dole out those 2MB blocks unless >>>>>> explicitly opted-in. >>>>>> >>>>>> I'm going to be out on holiday for a couple of weeks, so we might have to wait >>>>>> until I'm back to conclude on this, if you still take issue with the justification. >>>>> >>>>> From my understanding (correct me if I am wrong), Yu seems to want order-0 to be >>>>> the default order even if LAF is enabled. But that does not make sense to me, since >>>>> if LAF is configured to be enabled (it is disabled by default now), user (and distros) >>>>> must think LAF is giving benefit. Otherwise, they will just disable LAF at compilation >>>>> time or by using prctl. Enabling LAF and using order-0 as the default order makes >>>>> most of LAF code not used. >>>> For the device with limited memory size and it still wants LAF enabled for some specific >>>> memory ranges, it's possible the LAF is enabled, order-0 as default order and use madvise >>>> to enable LAF for specific memory ranges. >>> >>> Do you have a use case? Or it is just a possible scenario? >> It's a possible scenario. Per my experience, it's valid use case for embedded >> system or low end android phone. >> >>> >>> IIUC, Ryan has a concrete use case for his choice. For ARM64 with 16KB/64KB >>> base pages, 2MB folios (LAF in this config) would be desirable since THP is >>> 32MB/512MB and much harder to get. >>> >>>> >>>> So my understanding is it's possible case. But it's another configuration thing and not >>>> necessary to be finalized now. >>> >>> Basically, we are deciding whether LAF should use order-0 by default once it is >>> compiled in to kernel. From your other email on ANON_FOLIO_MAX_ORDER_UNHINTED, >>> your argument is that code change is needed to test the impact of LAF with >>> different orders. That seems to imply we actually need an extra knob (maybe sysctl) >>> to control the max LAF order. And with that extra knob, we can solve this default >>> order problem, since we can set it to 0 for devices want to opt in LAF and set >>> it N (like 64KB) for other devices want to opt out LAF. >> From performance tuning perspective, it's necessary to have knobs to configure and >> check the attribute of LAF. But we must be careful to add the knobs as they need >> be maintained for ever. > > If we do not want to maintain such a knob (since it may take some time to finalize) > and tweaking LAF order is important for us to explore different LAF configurations > (Ryan thinks 64KB will perform well on ARM64, whereas Yu mentioned 16KB/32KB is > better in his use cases), we probably just put the LAF order knob in debugfs > like Ryan suggested before to move forward. Works for me. > > >>> >>> So maybe we need the extra knob for both testing purpose and serving different >>> device configuration purpose. >>> >>>>> >>>>> Also arch_wants_pte_order() might need a better name like >>>>> arch_wants_large_folio_order(). Since current name sounds like the specified order >>>>> is wanted by HW in a general setting, but it is not. It is an order HW wants >>>>> when LAF is enabled. That might cause some confusion. >>>>> >>>>>>> >>>>>>> [1] https://lore.kernel.org/linux-mm/CAOUHufZ7HJZW8Srwatyudf=FbwTGQtyq4DyL2SHwSg37N_Bo_A@xxxxxxxxxxxxxx/ >>>>> >>>>> >>>>> -- >>>>> Best Regards, >>>>> Yan, Zi >>> >>> >>> -- >>> Best Regards, >>> Yan, Zi > > > -- > Best Regards, > Yan, Zi