On 26/03/2024 16:27, David Hildenbrand wrote: > On 15.02.24 13:17, Ryan Roberts wrote: >> With the introduction of contpte mapping support for arm64, that >> architecture's implementation of ptep_get_lockless() has become very >> complex due to the need to gather access and dirty bits from across all >> of the ptes in the contpte block. This requires careful implementation >> to ensure the returned value is consistent (because its not possible to >> read all ptes atomically), but even in the common case when there is no >> racing modification, we have to read all ptes, which gives an ~O(n^2) >> cost if the core-mm is iterating over a range, and performing a >> ptep_get_lockless() on each pte. >> >> Solve this by introducing ptep_get_lockless_norecency(), which does not >> make any guarantees about access and dirty bits. Therefore it can simply >> read the single target pte. >> >> At the same time, convert all call sites that previously used >> ptep_get_lockless() but don't care about access and dirty state. >> > > I'd probably split that part off. I thought the general guidance was to introduce new APIs in same patch they are first used in? If I split this off, I'll have one patch for a new (unused) API, then another for the first users. > >> We may want to do something similar for ptep_get() (i.e. >> ptep_get_norecency()) in future; it doesn't suffer from the consistency >> problem because the PTL serializes it with any modifications, but does >> suffer the same O(n^2) cost. >> >> Signed-off-by: Ryan Roberts <ryan.roberts@xxxxxxx> >> --- >> include/linux/pgtable.h | 37 ++++++++++++++++++++++++++++++++++--- >> kernel/events/core.c | 2 +- >> mm/hugetlb.c | 2 +- >> mm/khugepaged.c | 2 +- >> mm/memory.c | 2 +- >> mm/swap_state.c | 2 +- >> mm/swapfile.c | 2 +- >> 7 files changed, 40 insertions(+), 9 deletions(-) >> >> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h >> index a36cf4e124b0..9dd40fdbd825 100644 >> --- a/include/linux/pgtable.h >> +++ b/include/linux/pgtable.h >> @@ -528,16 +528,47 @@ static inline pmd_t pmdp_get_lockless(pmd_t *pmdp) >> #endif /* CONFIG_PGTABLE_LEVELS > 2 */ >> #endif /* CONFIG_GUP_GET_PXX_LOW_HIGH */ >> >> -/* >> - * We require that the PTE can be read atomically. >> - */ >> #ifndef ptep_get_lockless >> +/** >> + * ptep_get_lockless - Get a pte without holding the page table lock. Young and >> + * dirty bits are guaranteed to accurately reflect the state >> + * of the pte at the time of the call. >> + * @ptep: Page table pointer for pte to get. >> + * >> + * If young and dirty information is not required, use >> + * ptep_get_lockless_norecency() which can be faster on some architectures. >> + * >> + * May be overridden by the architecture; otherwise, implemented using >> + * ptep_get(), on the assumption that it is atomic. >> + * >> + * Context: Any. >> + */ > > I think we usually say "Any context.". But I would just do it like idr.h: > > "Any context. It is safe to call this function without locking in your code." > > ... but is this true? We really want to say "without page table lock". Because > there must be some way to prevent against concurrent page table freeing. For > example, GUP-fast disables IRQs, whereby page table freeing code frees using RCU. How about: " Context: Any context that guarrantees the page table can't be freed concurrently. The page table lock is not required. " > >> static inline pte_t ptep_get_lockless(pte_t *ptep) >> { >> return ptep_get(ptep); >> } >> #endif >> >> +#ifndef ptep_get_lockless_norecency >> +/** >> + * ptep_get_lockless_norecency - Get a pte without holding the page table lock. >> + * Young and dirty bits may not be accurate. >> + * @ptep: Page table pointer for pte to get. >> + * >> + * Prefer this over ptep_get_lockless() when young and dirty information is not >> + * required since it can be faster on some architectures. >> + * >> + * May be overridden by the architecture; otherwise, implemented using the more >> + * precise ptep_get_lockless(). >> + * >> + * Context: Any. > > Same comment. > >> + */ >> +static inline pte_t ptep_get_lockless_norecency(pte_t *ptep) >> +{ >> + return ptep_get_lockless(ptep); >> +} >> +#endif > > [...] > >> diff --git a/mm/hugetlb.c b/mm/hugetlb.c >> index 68283e54c899..41dc44eb8454 100644 >> --- a/mm/hugetlb.c >> +++ b/mm/hugetlb.c >> @@ -7517,7 +7517,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct >> vm_area_struct *vma, >> } >> >> if (pte) { >> - pte_t pteval = ptep_get_lockless(pte); >> + pte_t pteval = ptep_get_lockless_norecency(pte); >> >> BUG_ON(pte_present(pteval) && !pte_huge(pteval)); >> } >> diff --git a/mm/khugepaged.c b/mm/khugepaged.c >> index 2771fc043b3b..1a6c9ed8237a 100644 >> --- a/mm/khugepaged.c >> +++ b/mm/khugepaged.c >> @@ -1019,7 +1019,7 @@ static int __collapse_huge_page_swapin(struct mm_struct >> *mm, >> } >> } >> >> - vmf.orig_pte = ptep_get_lockless(pte); >> + vmf.orig_pte = ptep_get_lockless_norecency(pte); >> if (!is_swap_pte(vmf.orig_pte)) >> continue; > > > Hm, I think you mentioned that we want to be careful with vmf.orig_pte. Yeah good point. So I guess this should move to patch 3 (which may be dropped - tbd)?