>>>>>>>> - VM_BUG_ON_FOLIO(folio_test_large(folio), folio); >>>>>>>> - >>>>>>>> - if (!pageout && pte_young(ptent)) { >>>>>>>> - ptent = ptep_get_and_clear_full(mm, addr, pte, >>>>>>>> - tlb->fullmm); >>>>>>>> - ptent = pte_mkold(ptent); >>>>>>>> - set_pte_at(mm, addr, pte, ptent); >>>>>>>> - tlb_remove_tlb_entry(tlb, pte, addr); >>>>>>>> + if (!pageout) { >>>>>>>> + for (; nr != 0; nr--, pte++, addr += PAGE_SIZE) { >>>>>>>> + if (ptep_test_and_clear_young(vma, addr, pte)) >>>>>>>> + tlb_remove_tlb_entry(tlb, pte, addr); >>>>> >>>>> IIRC, some of the architecture(ex, PPC) don't update TLB with set_pte_at and >>>>> tlb_remove_tlb_entry. So, didn't we consider remapping the PTE with old after >>>>> pte clearing? >>>> >>>> Sorry Lance, I don't understand this question, can you rephrase? Are you saying >>>> there is a good reason to do the original clear-mkold-set for some arches? >>> >>> IIRC, some of the architecture(ex, PPC) don't update TLB with >>> ptep_test_and_clear_young() >>> and tlb_remove_tlb_entry(). Afraid I'm still struggling with this comment. Do you mean to say that powerpc invalidates the TLB entry as part of the call to ptep_test_and_clear_young()? So tlb_remove_tlb_entry() would be redundant here, and likely cause performance degradation on that architecture? IMHO, ptep_test_and_clear_young() really shouldn't be invalidating the TLB entry, that's what ptep_clear_flush_young() is for. But I do see that for some cases of the 32-bit ppc, there appears to be a flush: #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG static inline int __ptep_test_and_clear_young(struct mm_struct *mm, unsigned long addr, pte_t *ptep) { unsigned long old; old = pte_update(mm, addr, ptep, _PAGE_ACCESSED, 0, 0); if (old & _PAGE_HASHPTE) flush_hash_entry(mm, ptep, addr); <<<<<<<< return (old & _PAGE_ACCESSED) != 0; } #define ptep_test_and_clear_young(__vma, __addr, __ptep) \ __ptep_test_and_clear_young((__vma)->vm_mm, __addr, __ptep) Is that what you are describing? Does any anyone know why flush_hash_entry() is called? I'd say that's a bug in ppc and not a reason not to use ptep_test_and_clear_young() in the common code! Thanks, Ryan >> >> Err, I assumed tlb_remove_tlb_entry() meant "invalidate the TLB entry for this >> address please" - albeit its deferred and batched. I'll look into this. >> >>> >>> In my new patch[1], I use refresh_full_ptes() and >>> tlb_remove_tlb_entries() to batch-update the >>> access and dirty bits. >> >> I want to avoid the per-pte clear-modify-set approach, because this doesn't >> perform well on arm64 when using contpte mappings; it will cause the contpe >> mapping to be unfolded by the first clear that touches the contpte block, then >> refolded by the last set to touch the block. That's expensive. >> ptep_test_and_clear_young() doesn't suffer that problem. > > Thanks for explaining. I got it. > > I think that other architectures will benefit from the per-pte clear-modify-set > approach. IMO, refresh_full_ptes() can be overridden by arm64. > > Thanks, > Lance >> >>> >>> [1] https://lore.kernel.org/linux-mm/20240316102952.39233-1-ioworker0@xxxxxxxxx >>> >>> Thanks, >>> Lance >>> >>>> >>>>> >>>>> Thanks, >>>>> Lance >>>>> >>>>> >>>>> >>>>>>>> + } >>>>>>> >>>>>>> This looks so smart. if it is not pageout, we have increased pte >>>>>>> and addr here; so nr is 0 and we don't need to increase again in >>>>>>> for (; addr < end; pte += nr, addr += nr * PAGE_SIZE) >>>>>>> >>>>>>> otherwise, nr won't be 0. so we will increase addr and >>>>>>> pte by nr. >>>>>> >>>>>> Indeed. I'm hoping that Lance is able to follow a similar pattern for >>>>>> madvise_free_pte_range(). >>>>>> >>>>>> >>>>>>> >>>>>>> >>>>>>>> } >>>>>>>> >>>>>>>> /* >>>>>>>> -- >>>>>>>> 2.25.1 >>>>>>>> >>>>>>> >>>>>>> Overall, LGTM, >>>>>>> >>>>>>> Reviewed-by: Barry Song <v-songbaohua@xxxxxxxx> >>>>>> >>>>>> Thanks! >>>>>> >>>>>> >>>> >>