On Fri, Dec 15, 2023 at 15:23 PM Yu Zhao <yuzhao@xxxxxxxxxx> wrote: >Regarding the change itself, it'd cause a slight regression to other >use cases (details below). > > > @@ -3355,6 +3359,7 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end, > > unsigned long pfn; > > struct folio *folio; > > pte_t ptent = ptep_get(pte + i); > > + bool is_pte_young; > > > > total++; > > walk->mm_stats[MM_LEAF_TOTAL]++; > > @@ -3363,16 +3368,20 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end, > > if (pfn == -1) > > continue; > > > > - if (!pte_young(ptent)) { > > - walk->mm_stats[MM_LEAF_OLD]++; > > Most overhead from page table scanning normally comes from > get_pfn_folio() because it almost always causes a cache miss. This is > like a pointer dereference, whereas scanning PTEs is like streaming an > array (bad vs good cache performance). > > pte_young() is here to avoid an unnecessary cache miss from > get_pfn_folio(). Also see the first comment in get_pfn_folio(). It > should be easy to verify the regression -- FlameGraph from the > memcached benchmark in the original commit message should do it. > > Would a tracepoint here work for you? > > > > > + is_pte_young = !!pte_young(ptent); > > + folio = get_pfn_folio(pfn, memcg, pgdat, walk->can_swap, is_pte_young); > > + if (!folio) { > > + if (!is_pte_young) > > + walk->mm_stats[MM_LEAF_OLD]++; > > continue; > > } > > > > - folio = get_pfn_folio(pfn, memcg, pgdat, walk->can_swap); > > - if (!folio) > > + if (!folio_test_clear_young(folio) && !is_pte_young) { > > + walk->mm_stats[MM_LEAF_OLD]++; > > continue; > > + } > > > > - if (!ptep_test_and_clear_young(args->vma, addr, pte + i)) > > + if (is_pte_young && !ptep_test_and_clear_young(args->vma, addr, pte + i)) > > VM_WARN_ON_ONCE(true); > > > > young++; Thanks for replying. For avoiding below: 1. confict between page_idle/bitmap and mglru scan 2. performance downgrade in mglru page-table scan if call get_pfn_folio for each pte. We have a new idea: 1. Include a new api under /sys/kernel/mm/page_idle, support mark idle flag only, without rmap walking or clearing pte young. 2. Use mglru proactive scan to clear page idle flag. workflows: t1 t2 mark pages idle mglru scan and check pages idle It's easy for us to know that whether a page is accessed during t1~t2. Some code changes like these: We clear idle flags in get_pfn_folio, and in walk_pte_range we still follow original design. static struct folio *get_pfn_folio(unsigned long pfn, struct mem_cgroup *memcg, struct pglist_data *pgdat, bool can_swap, bool clear_idle) { struct folio *folio; /* try to avoid unnecessary memory loads */ if (pfn < pgdat->node_start_pfn || pfn >= pgdat_end_pfn(pgdat)) return NULL; folio = pfn_folio(pfn); + + if (clear_idle && folio_test_idle(folio)) + folio_clear_idle(folio); + if (folio_nid(folio) != pgdat->node_id) return NULL; if (folio_memcg_rcu(folio) != memcg) return NULL; /* file VMAs can contain anon pages from COW */ if (!folio_is_file_lru(folio) && !can_swap) return NULL; return folio; } static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end, struct mm_walk *args) { ... for (i = pte_index(start), addr = start; addr != end; i++, addr += PAGE_SIZE) { unsigned long pfn; struct folio *folio; pte_t ptent = ptep_get(pte + i); total++; walk->mm_stats[MM_LEAF_TOTAL]++; pfn = get_pte_pfn(ptent, args->vma, addr); if (pfn == -1) continue; if (!pte_young(ptent)) { walk->mm_stats[MM_LEAF_OLD]++; continue; } + folio = get_pfn_folio(pfn, memcg, pgdat, walk->can_swap, true); - folio = get_pfn_folio(pfn, memcg, pgdat, walk->can_swap); if (!folio) continue; ... } Is it a good idea or not ? -- 2.43.0