On Wed, Oct 25, 2023 at 09:39:19AM +0800, Yin, Fengwei wrote: > > >> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h > >> index 0bd18de9fd97..2979d796ba9d 100644 > >> --- a/arch/arm64/include/asm/pgtable.h > >> +++ b/arch/arm64/include/asm/pgtable.h > >> @@ -905,21 +905,22 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma, > >> static inline int ptep_clear_flush_young(struct vm_area_struct *vma, > >> unsigned long address, pte_t *ptep) > >> { > >> - int young = ptep_test_and_clear_young(vma, address, ptep); > >> - > >> - if (young) { > >> - /* > >> - * We can elide the trailing DSB here since the worst that can > >> - * happen is that a CPU continues to use the young entry in its > >> - * TLB and we mistakenly reclaim the associated page. The > >> - * window for such an event is bounded by the next > >> - * context-switch, which provides a DSB to complete the TLB > >> - * invalidation. > >> - */ > >> - flush_tlb_page_nosync(vma, address); > >> - } > >> - > >> - return young; > >> + /* > >> + * This comment is borrowed from x86, but applies equally to ARM64: > >> + * > >> + * Clearing the accessed bit without a TLB flush doesn't cause > >> + * data corruption. [ It could cause incorrect page aging and > >> + * the (mistaken) reclaim of hot pages, but the chance of that > >> + * should be relatively low. ] > >> + * > >> + * So as a performance optimization don't flush the TLB when > >> + * clearing the accessed bit, it will eventually be flushed by > >> + * a context switch or a VM operation anyway. [ In the rare > >> + * event of it not getting flushed for a long time the delay > >> + * shouldn't really matter because there's no real memory > >> + * pressure for swapout to react to. ] > >> + */ > >> + return ptep_test_and_clear_young(vma, address, ptep); > >> } > From https://lore.kernel.org/lkml/20181029105515.GD14127@xxxxxxx/: > > This is blindly copied from x86 and isn't true for us: we don't invalidate > the TLB on context switch. That means our window for keeping the stale > entries around is potentially much bigger and might not be a great idea. I completely agree. > My understanding is that arm64 doesn't do invalidate the TLB during > context switch. The flush_tlb_page_nosync() here + DSB during context > switch make sure the TLB is invalidated during context switch. > So we can't remove flush_tlb_page_nosync() here? Or something was changed > for arm64 (I have zero knowledge to TLB on arm64. So some obvious thing > may be missed)? Thanks. As you point out, we already elide the DSB here but I don't think we should remove the TLB invalidation entirely because then we lose the guarantee that the update ever becomes visible to the page-table walker. I'm surprised that the TLBI is showing up as a performance issue without the DSB present. Is it because we're walking over a large VA range and invalidating on a per-page basis? If so, we'd be better off batching them up and doing the invalidation at the end (which will be upgraded to a full-mm invalidation if the range is large enough). Will