From: Barry Song <v-songbaohua@xxxxxxxx> Whether it is done through hardware or software, TLB flushing is usually extremely expensive. Since a page can be mapped by lots of processes at the same time, in folio_referenced_one(), each process with pte_young will send a tlb broadcast, this further increases the overhead of tlb flush exponentially. Some platforms have tried to remove the overhead of tlb flush by implementing their own ptep_clear_flush_young() in which, flush are dropped(x86, s390, powerpc, riscv) or deferred(arm64). This approach has obviously broken the semantics of the API since it is named as "flush". Dropping flush in a function named "flush" isn't cool. On ARM64, flush_tlb_page_nosync() is used as a cheaper way in ptep_clear_flush_young() to replace the more expensive sync tlb broadcast with dsb. But the cost of this nosync alternative has probably been underestimated. Profiling is done by running a program with high memory pressure on rk3568 64bit quad core processor Quad Core Cortex-A55 platform - ROCK 3A with 4GB memory, using zRAM as swap device. In the program, 8 processes are trying to access one shared memory as below, int main() { #define MB (1024 * 1024) pid_t pid = getpid(); volatile unsigned char *p = mmap(NULL, 4096UL * MB, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0); memset(p, 0x11, 4096UL * MB); /* simulate memory mapped by multiple processes,like libs, .txt section, shmem */ fork(); fork(); fork(); while(1) { int i; /* randomly get an offset then access 1024 pages */ unsigned long offset = (rand() % MB); if (offset + 1024 > MB) offset = MB - 1024; for (i = 0; i < 1024; i++) { (void)p[(offset + i) * 4096]; } usleep(1000); } } After removing "inline" before flush_tlb_page_nosync() as below, <static inline void flush_tlb_page_nosync(struct vm_area_struct *vma, >static noinline void flush_tlb_page_nosync(struct vm_area_struct *vma, perf result for kswapd is quite surprising, 19.63% kswapd0 [kernel.kallsyms] [k] page_vma_mapped_walk 10.69% kswapd0 [kernel.kallsyms] [k] flush_tlb_page_nosync 6.73% kswapd0 [kernel.kallsyms] [k] folio_referenced_one 5.92% kswapd0 [kernel.kallsyms] [k] zram_bvec_rw.constprop.0.isra.0 4.55% kswapd0 [kernel.kallsyms] [k] ptep_clear_flush 3.66% kswapd0 [kernel.kallsyms] [k] _raw_spin_lock 2.87% kswapd0 [kernel.kallsyms] [k] rmap_walk_file 2.72% kswapd0 [kernel.kallsyms] [k] try_to_unmap_one 2.03% kswapd0 [kernel.kallsyms] [k] vma_interval_tree_iter_next 1.86% kswapd0 [kernel.kallsyms] [k] shrink_page_list 1.86% kswapd0 [kernel.kallsyms] [k] isolate_lru_pages 1.78% kswapd0 [kernel.kallsyms] [k] _raw_spin_unlock 1.23% kswapd0 [kernel.kallsyms] [k] vma_interval_tree_subtree_search 1.15% kswapd0 [kernel.kallsyms] [k] PageHuge 1.02% kswapd0 [kernel.kallsyms] [k] check_pte If flush_tlb_page_nosync() is inlined, its overhead will be counted somewhere else. That's why the profiling is removing the inline. The 10.60% overhead demonstrates for ARM64, we still need to move to ptep_clear_young_notify() after we have used the nosync tlbi. In addition to those commits to remove flush in platforms such as riscv, x86, powerpc, Yu Zhao also listed some other evidences to support moving to ptep_clear_young_notify() in vmscan within the discussion of MGLRU. * The fundamental hardware limitation in terms of the TLB scalability[1] * Alexander's benchmark[2] * TLB doesn't cache stale pte young most of the time, flushing TLB just for the sake of the A-bit isn't necessary[3] This patch solves the problem from the source - vmscan, so probably platforms which haven't dropped flush can benefit directly. On the other hand, ARM64 with lightweight tlbi can also eventually remove the overhead of nosync tlb flush. At last but not least, MGLRU has no flush in look_around after clearing pte young, this patch also makes vmscan generally consistent with the approach of MGLRU. [1] https://www.usenix.org/legacy/events/osdi02/tech/full_papers/navarro/navarro.pdf [2] https://lore.kernel.org/r/BYAPR12MB271295B398729E07F31082A7CFAA0@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/ [3] https://lore.kernel.org/lkml/CAOUHufbOwPSbBwd7TG0QFt4YJvBp93Q9nUJEDvMpUA6PqjYMUQ@xxxxxxxxxxxxxx/ Cc: Yu Zhao <yuzhao@xxxxxxxxxx> Cc: Will Deacon <will@xxxxxxxxxx> Cc: Alex Van Brunt <avanbrunt@xxxxxxxxxx> Cc: Shaohua Li <shli@xxxxxxxxxx> Signed-off-by: Barry Song <v-songbaohua@xxxxxxxx> --- -v1 differences with rfc * refine commit log * investigate on arm64's flush_tlb_page_nosync with memory pressure mm/rmap.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/rmap.c b/mm/rmap.c index 5bcb334cd6f2..7ce6f0b6c330 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -830,7 +830,7 @@ static bool folio_referenced_one(struct folio *folio, } if (pvmw.pte) { - if (ptep_clear_flush_young_notify(vma, address, + if (ptep_clear_young_notify(vma, address, pvmw.pte)) { /* * Don't treat a reference through -- 2.25.1