On Wed, Dec 6, 2023 at 5:51 AM Henry Huang <henry.hj@xxxxxxxxxxxx> wrote: > > Multi-Gen LRU page-table walker clears pte young flag, but it doesn't > clear page idle flag. When we use /sys/kernel/mm/page_idle/bitmap to check > whether one page is accessed, it would tell us this page is idle, > but actually this page has been accessed. > > For those unmapped filecache pages, page idle flag would not been > cleared in folio_mark_accessed if Multi-Gen LRU is enabled. > So we couln't use /sys/kernel/mm/page_idle/bitmap to check whether > a filecache page is read or written. > > What's more, /sys/kernel/mm/page_idle/bitmap also clears pte young flag. > If one page is accessed, it would set page young flag. Multi-Gen LRU > page-table walker should check both page&pte young flags. > > how-to-reproduce-problem > > idle_page_track > a tools to track process accessed memory during a specific time > usage > idle_page_track $pid $time > how-it-works > 1. scan process vma from /proc/$pid/maps > 2. vfn --> pfn from /proc/$pid/pagemap > 3. write /sys/kernel/mm/page_idle/bitmap to > mark phy page idle flag and clear pte young flag > 4. sleep $time > 5. read /sys/kernel/mm/page_idle/bitmap to > test_and_clear pte young flag and > return whether phy page is accessed > > test ---- test program > > #include <stdio.h> > #include <stdlib.h> > #include <string.h> > #include <unistd.h> > #include <sys/types.h> > #include <sys/stat.h> > #include <fcntl.h> > > int main(int argc, const char *argv[]) > { > char *buf = NULL; > char pipe_info[4096]; > int n; > int fd = -1; > > buf = malloc(1024*1024*1024UL); > memset(buf, 0, 1024*1024*1024UL); > fd = open("access.pipe", O_RDONLY); > if (fd < 0) > goto out; > while (1) { > n = read(fd, pipe_info, sizeof(pipe_info)); > if (!n) { > sleep(1); > continue; > } else if (n < 0) { > break; > } > memset(buf, 0, 1024*1024*1024UL); > puts("finish access"); > } > out: > if (fd >=0) > close(fd); > if (buf) > free(buf); > > return 0; > } > > prepare: > mkfifo access.pipe > ./test > ps -ef | grep test > root 4106 3148 8 06:47 pts/0 00:00:01 ./test > > We use /sys/kernel/debug/lru_gen to simulate mglru page-table scan. > > case 1: mglru walker break page_idle > ./idle_page_track 4106 60 & > sleep 5; echo 1 > access.pipe > sleep 5; echo '+ 8 0 6 1 1' > /sys/kernel/debug/lru_gen > > the output of idle_page_track is: > Est(s) Ref(MB) > 64.822 1.00 > only found 1MB were accessed during 64.822s, but actually 1024MB were > accessed. > > case 2: page_idle break mglru walker > echo 1 > access.pipe > ./idle_page_track 4106 10 > echo '+ 8 0 7 1 1' > /sys/kernel/debug/lru_gen > lru gen status: > memcg 8 /user.slice > node 0 > 5 772458 1065 9735 > 6 737435 262244 72 > 7 538053 1184 632 > 8 59404 6422 0 > almost pages should be in max_seq-1 queue, but actually not. > > Signed-off-by: Henry Huang <henry.hj@xxxxxxxxxxxx> It's never intended for MGLRU to support page_idle/bitmap or PG_idle/young because: 1. page_idle/bitmap isn't a capable interface at all -- yes, Google proposed the idea [1], but we don't really use it anymore because of its poor scalability. 2. PG_idle/young, being a boolean value, has poor granularity. If anyone must use page_idle/bitmap for some specific reason, I'd recommend exporting generation numbers instead. [1] https://lore.kernel.org/cover.1426706637.git.vdavydov@xxxxxxxxxxxxx/