On 27/03/2024 09:28, David Hildenbrand wrote: > On 26.03.24 17:39, Ryan Roberts wrote: >> On 26/03/2024 16:27, David Hildenbrand wrote: >>> On 15.02.24 13:17, Ryan Roberts wrote: >>>> With the introduction of contpte mapping support for arm64, that >>>> architecture's implementation of ptep_get_lockless() has become very >>>> complex due to the need to gather access and dirty bits from across all >>>> of the ptes in the contpte block. This requires careful implementation >>>> to ensure the returned value is consistent (because its not possible to >>>> read all ptes atomically), but even in the common case when there is no >>>> racing modification, we have to read all ptes, which gives an ~O(n^2) >>>> cost if the core-mm is iterating over a range, and performing a >>>> ptep_get_lockless() on each pte. >>>> >>>> Solve this by introducing ptep_get_lockless_norecency(), which does not >>>> make any guarantees about access and dirty bits. Therefore it can simply >>>> read the single target pte. >>>> >>>> At the same time, convert all call sites that previously used >>>> ptep_get_lockless() but don't care about access and dirty state. >>>> >>> >>> I'd probably split that part off. >> >> I thought the general guidance was to introduce new APIs in same patch they are >> first used in? If I split this off, I'll have one patch for a new (unused) API, >> then another for the first users. > > I don't know what exact guidance there is, but I tend to leave "non trivial > changes" to separate patches. > > Some of the changes here are rather trivial (mm/hugetlb.c), and I agree that we > can perform them here. > > At least the "vmf.orig_pte" looked "non-trivial" to me, thus my comment. got it. > >> >>> >>>> We may want to do something similar for ptep_get() (i.e. >>>> ptep_get_norecency()) in future; it doesn't suffer from the consistency >>>> problem because the PTL serializes it with any modifications, but does >>>> suffer the same O(n^2) cost. >>>> >>>> Signed-off-by: Ryan Roberts <ryan.roberts@xxxxxxx> >>>> --- >>>> include/linux/pgtable.h | 37 ++++++++++++++++++++++++++++++++++--- >>>> kernel/events/core.c | 2 +- >>>> mm/hugetlb.c | 2 +- >>>> mm/khugepaged.c | 2 +- >>>> mm/memory.c | 2 +- >>>> mm/swap_state.c | 2 +- >>>> mm/swapfile.c | 2 +- >>>> 7 files changed, 40 insertions(+), 9 deletions(-) >>>> >>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h >>>> index a36cf4e124b0..9dd40fdbd825 100644 >>>> --- a/include/linux/pgtable.h >>>> +++ b/include/linux/pgtable.h >>>> @@ -528,16 +528,47 @@ static inline pmd_t pmdp_get_lockless(pmd_t *pmdp) >>>> #endif /* CONFIG_PGTABLE_LEVELS > 2 */ >>>> #endif /* CONFIG_GUP_GET_PXX_LOW_HIGH */ >>>> >>>> -/* >>>> - * We require that the PTE can be read atomically. >>>> - */ >>>> #ifndef ptep_get_lockless >>>> +/** >>>> + * ptep_get_lockless - Get a pte without holding the page table lock. Young >>>> and >>>> + * dirty bits are guaranteed to accurately reflect the >>>> state >>>> + * of the pte at the time of the call. >>>> + * @ptep: Page table pointer for pte to get. >>>> + * >>>> + * If young and dirty information is not required, use >>>> + * ptep_get_lockless_norecency() which can be faster on some architectures. >>>> + * >>>> + * May be overridden by the architecture; otherwise, implemented using >>>> + * ptep_get(), on the assumption that it is atomic. >>>> + * >>>> + * Context: Any. >>>> + */ >>> >>> I think we usually say "Any context.". But I would just do it like idr.h: >>> >>> "Any context. It is safe to call this function without locking in your code." >>> >>> ... but is this true? We really want to say "without page table lock". Because >>> there must be some way to prevent against concurrent page table freeing. For >>> example, GUP-fast disables IRQs, whereby page table freeing code frees using >>> RCU. >> >> How about: >> >> " >> Context: Any context that guarrantees the page table can't be freed > > s/guarrantees/guarantees/ > >> concurrently. The page table lock is not required. >> " >> > > Sounds good. Great! > >>> >>>> static inline pte_t ptep_get_lockless(pte_t *ptep) >>>> { >>>> return ptep_get(ptep); >>>> } >>>> #endif >>>> >>>> +#ifndef ptep_get_lockless_norecency >>>> +/** >>>> + * ptep_get_lockless_norecency - Get a pte without holding the page table >>>> lock. >>>> + * Young and dirty bits may not be accurate. >>>> + * @ptep: Page table pointer for pte to get. >>>> + * >>>> + * Prefer this over ptep_get_lockless() when young and dirty information is >>>> not >>>> + * required since it can be faster on some architectures. >>>> + * >>>> + * May be overridden by the architecture; otherwise, implemented using the >>>> more >>>> + * precise ptep_get_lockless(). >>>> + * >>>> + * Context: Any. >>> >>> Same comment. >>> >>>> + */ >>>> +static inline pte_t ptep_get_lockless_norecency(pte_t *ptep) >>>> +{ >>>> + return ptep_get_lockless(ptep); >>>> +} >>>> +#endif >>> >>> [...] >>> >>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c >>>> index 68283e54c899..41dc44eb8454 100644 >>>> --- a/mm/hugetlb.c >>>> +++ b/mm/hugetlb.c >>>> @@ -7517,7 +7517,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct >>>> vm_area_struct *vma, >>>> } >>>> >>>> if (pte) { >>>> - pte_t pteval = ptep_get_lockless(pte); >>>> + pte_t pteval = ptep_get_lockless_norecency(pte); >>>> >>>> BUG_ON(pte_present(pteval) && !pte_huge(pteval)); >>>> } >>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c >>>> index 2771fc043b3b..1a6c9ed8237a 100644 >>>> --- a/mm/khugepaged.c >>>> +++ b/mm/khugepaged.c >>>> @@ -1019,7 +1019,7 @@ static int __collapse_huge_page_swapin(struct mm_struct >>>> *mm, >>>> } >>>> } >>>> >>>> - vmf.orig_pte = ptep_get_lockless(pte); >>>> + vmf.orig_pte = ptep_get_lockless_norecency(pte); >>>> if (!is_swap_pte(vmf.orig_pte)) >>>> continue; >>> >>> >>> Hm, I think you mentioned that we want to be careful with vmf.orig_pte. >> >> Yeah good point. So I guess this should move to patch 3 (which may be dropped - >> tbd)? >> > > Yes. Or a separate one where you explain in detail why do_swap_page() can handle > it just fine. Ahh no wait - I remember now; the reason I believe this is a "trivial" case is because we only leak vmf.orig_pte to the rest of the world if its a swap entry. And if its a swap entry, then ptep_get_lockless_norecency() is equivalent to ptep_get_lockless() - the pte is not present so there are no access or dirty bits. So I think this can stay here?