On Wed, Dec 6, 2023 at 5:51 AM Henry Huang <henry.hj@xxxxxxxxxxxx> wrote: > > Multi-Gen LRU page-table walker clears pte young flag, but it doesn't > clear page idle flag. When we use /sys/kernel/mm/page_idle/bitmap to check > whether one page is accessed, it would tell us this page is idle, > but actually this page has been accessed. > > For those unmapped filecache pages, page idle flag would not been > cleared in folio_mark_accessed if Multi-Gen LRU is enabled. > So we couln't use /sys/kernel/mm/page_idle/bitmap to check whether > a filecache page is read or written. > > What's more, /sys/kernel/mm/page_idle/bitmap also clears pte young flag. > If one page is accessed, it would set page young flag. Multi-Gen LRU > page-table walker should check both page&pte young flags. > > how-to-reproduce-problem > > idle_page_track > a tools to track process accessed memory during a specific time > usage > idle_page_track $pid $time > how-it-works > 1. scan process vma from /proc/$pid/maps > 2. vfn --> pfn from /proc/$pid/pagemap > 3. write /sys/kernel/mm/page_idle/bitmap to > mark phy page idle flag and clear pte young flag > 4. sleep $time > 5. read /sys/kernel/mm/page_idle/bitmap to > test_and_clear pte young flag and > return whether phy page is accessed > > test ---- test program > > #include <stdio.h> > #include <stdlib.h> > #include <string.h> > #include <unistd.h> > #include <sys/types.h> > #include <sys/stat.h> > #include <fcntl.h> > > int main(int argc, const char *argv[]) > { > char *buf = NULL; > char pipe_info[4096]; > int n; > int fd = -1; > > buf = malloc(1024*1024*1024UL); > memset(buf, 0, 1024*1024*1024UL); > fd = open("access.pipe", O_RDONLY); > if (fd < 0) > goto out; > while (1) { > n = read(fd, pipe_info, sizeof(pipe_info)); > if (!n) { > sleep(1); > continue; > } else if (n < 0) { > break; > } > memset(buf, 0, 1024*1024*1024UL); > puts("finish access"); > } > out: > if (fd >=0) > close(fd); > if (buf) > free(buf); > > return 0; > } > > prepare: > mkfifo access.pipe > ./test > ps -ef | grep test > root 4106 3148 8 06:47 pts/0 00:00:01 ./test > > We use /sys/kernel/debug/lru_gen to simulate mglru page-table scan. > > case 1: mglru walker break page_idle > ./idle_page_track 4106 60 & > sleep 5; echo 1 > access.pipe > sleep 5; echo '+ 8 0 6 1 1' > /sys/kernel/debug/lru_gen > > the output of idle_page_track is: > Est(s) Ref(MB) > 64.822 1.00 > only found 1MB were accessed during 64.822s, but actually 1024MB were > accessed. > > case 2: page_idle break mglru walker > echo 1 > access.pipe > ./idle_page_track 4106 10 > echo '+ 8 0 7 1 1' > /sys/kernel/debug/lru_gen > lru gen status: > memcg 8 /user.slice > node 0 > 5 772458 1065 9735 > 6 737435 262244 72 > 7 538053 1184 632 > 8 59404 6422 0 > almost pages should be in max_seq-1 queue, but actually not. > > Signed-off-by: Henry Huang <henry.hj@xxxxxxxxxxxx> Regarding the change itself, it'd cause a slight regression to other use cases (details below). > @@ -3355,6 +3359,7 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end, > unsigned long pfn; > struct folio *folio; > pte_t ptent = ptep_get(pte + i); > + bool is_pte_young; > > total++; > walk->mm_stats[MM_LEAF_TOTAL]++; > @@ -3363,16 +3368,20 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end, > if (pfn == -1) > continue; > > - if (!pte_young(ptent)) { > - walk->mm_stats[MM_LEAF_OLD]++; Most overhead from page table scanning normally comes from get_pfn_folio() because it almost always causes a cache miss. This is like a pointer dereference, whereas scanning PTEs is like streaming an array (bad vs good cache performance). pte_young() is here to avoid an unnecessary cache miss from get_pfn_folio(). Also see the first comment in get_pfn_folio(). It should be easy to verify the regression -- FlameGraph from the memcached benchmark in the original commit message should do it. Would a tracepoint here work for you? > + is_pte_young = !!pte_young(ptent); > + folio = get_pfn_folio(pfn, memcg, pgdat, walk->can_swap, is_pte_young); > + if (!folio) { > + if (!is_pte_young) > + walk->mm_stats[MM_LEAF_OLD]++; > continue; > } > > - folio = get_pfn_folio(pfn, memcg, pgdat, walk->can_swap); > - if (!folio) > + if (!folio_test_clear_young(folio) && !is_pte_young) { > + walk->mm_stats[MM_LEAF_OLD]++; > continue; > + } > > - if (!ptep_test_and_clear_young(args->vma, addr, pte + i)) > + if (is_pte_young && !ptep_test_and_clear_young(args->vma, addr, pte + i)) > VM_WARN_ON_ONCE(true); > > young++;