On Fri, Mar 4, 2022 at 5:18 PM Minchan Kim <minchan@xxxxxxxxxx> wrote: > > On Fri, Mar 04, 2022 at 05:55:58PM +0000, Ivan Teterevkov wrote: > > Hi folks, > > > > I want to check if there's a regression in the madvise(MADV_COLD) behaviour > > with shared memory or my understanding of how it works is inaccurate. > > > > The MADV_COLD advice was introduced in Linux 5.4 and allowed the users to > > mark selected memory ranges as more "inactive" than others, overruling the > > default LRU accounting. It helped to preserve the working set of an > > application. With more recent kernels, e.g. at least 5.17.0-rc6 and 5.10.42, > > MADV_COLD has stopped working as expected. Please take a look at a short > > program that demonstrates it: > > > > /* > > * madvise(MADV_COLD) demo. > > */ > > #include <assert.h> > > #include <stdio.h> > > #include <stdlib.h> > > #include <string.h> > > #include <sys/mman.h> > > > > /* Requires the kernel 5.4 or newer. */ > > #ifndef MADV_COLD > > #define MADV_COLD 20 > > #endif > > > > #define GIB(x) ((size_t)(x) << 30) > > > > int main(void) > > { > > char *shmem, *zeroes; > > int page_size = getpagesize(); > > size_t i; > > > > /* Allocate 8 GiB of shared memory. */ > > shmem = mmap(/* addr */ NULL, > > /* length */ GIB(8), > > /* prot */ PROT_READ | PROT_WRITE, > > /* flags */ MAP_SHARED | MAP_ANONYMOUS, > > /* fd */ -1, > > /* offset */ 0); > > assert(shmem != MAP_FAILED); > > > > /* Allocate a zero page for future use. */ > > zeroes = calloc(1, page_size); > > assert(zeroes != NULL); > > > > /* Put 1 GiB blob at the beginning of the shared memory range. */ > > memset(shmem, 0xaa, GIB(1)); > > > > /* Read memory adjacent to the blob. */ > > for (i = GIB(1); i < GIB(8); i = i + page_size) { > > int res = memcmp(shmem + i, zeroes, page_size); > > assert(res == 0); > > > > /* Cooldown a zero page and make it "less active" than the blob. > > * Under memory pressure, it'll likely become a reclaim target > > * and thus will help to preserve the blob in memory. > > */ > > res = madvise(shmem + i, page_size, MADV_COLD); > > assert(res == 0); > > } > > > > /* Let the user check smaps. */ > > printf("done\n"); > > pause(); > > > > free(zeroes); > > munmap(shmem, GIB(8)); > > > > return 0; > > } > > > > How to run this program: > > > > 1. Create a "test" cgroup with a memory limit of 3 GiB. > > > > 1.1. cgroup v1: > > > > # mkdir /sys/fs/cgroup/memory/test > > # echo 3G > /sys/fs/cgroup/memory/test/memory.limit_in_bytes > > > > 1.2. cgroup v2: > > > > # mkdir /sys/fs/cgroup/test > > # echo 3G > /sys/fs/cgroup/test/memory.max > > > > 2. Enable at least a 1 GiB swap device. > > > > 3. Run the program in the "test" cgroup: > > > > # cgexec -g memory:test ./a.out > > > > 4. Wait until it has finished, i.e. has printed "done". > > > > 5. Check the shared memory VMA stats. > > > > 5.1. In 5.17.0-rc6 and 5.10.42: > > > > # cat /proc/$(pidof a.out)/smaps | grep -A 21 -B 1 8388608 > > 7f8ed4648000-7f90d4648000 rw-s 00000000 00:01 2055 /dev/zero > > (deleted) > > Size: 8388608 kB > > KernelPageSize: 4 kB > > MMUPageSize: 4 kB > > Rss: 3119556 kB > > Pss: 3119556 kB > > Shared_Clean: 0 kB > > Shared_Dirty: 0 kB > > Private_Clean: 3119556 kB > > Private_Dirty: 0 kB > > Referenced: 0 kB > > Anonymous: 0 kB > > LazyFree: 0 kB > > AnonHugePages: 0 kB > > ShmemPmdMapped: 0 kB > > FilePmdMapped: 0 kB > > Shared_Hugetlb: 0 kB > > Private_Hugetlb: 0 kB > > Swap: 1048576 kB > > SwapPss: 0 kB > > Locked: 0 kB > > THPeligible: 0 > > VmFlags: rd wr sh mr mw me ms sd > > > > 5.2. In 5.4.109: > > > > # cat /proc/$(pidof a.out)/smaps | grep -A 21 -B 1 8388608 > > 7fca5f78b000-7fcc5f78b000 rw-s 00000000 00:01 173051 /dev/zero > > (deleted) > > Size: 8388608 kB > > KernelPageSize: 4 kB > > MMUPageSize: 4 kB > > Rss: 3121504 kB > > Pss: 3121504 kB > > Shared_Clean: 0 kB > > Shared_Dirty: 0 kB > > Private_Clean: 2072928 kB > > Private_Dirty: 1048576 kB > > Referenced: 0 kB > > Anonymous: 0 kB > > LazyFree: 0 kB > > AnonHugePages: 0 kB > > ShmemPmdMapped: 0 kB > > FilePmdMapped: 0 kB > > Shared_Hugetlb: 0 kB > > Private_Hugetlb: 0 kB > > Swap: 0 kB > > SwapPss: 0 kB > > Locked: 0 kB > > THPeligible: 0 > > VmFlags: rd wr sh mr mw me ms > > > > There's a noticeable difference in the "Swap" reports so that the older > > kernel doesn't swap the blob, but the newer ones do. > > > > According to ftrace, the newer kernels still call deactivate_page() in > > madvise_cold(): > > > > # trace-cmd record -p function_graph -g madvise_cold > > # trace-cmd report | less > > a.out-4877 [000] 1485.266106: funcgraph_entry: | madvise_cold() { > > a.out-4877 [000] 1485.266115: funcgraph_entry: | walk_page_range() > > { > > a.out-4877 [000] 1485.266116: funcgraph_entry: | > > __walk_page_range() { > > a.out-4877 [000] 1485.266117: funcgraph_entry: | > > madvise_cold_or_pageout_pte_range() { > > a.out-4877 [000] 1485.266118: funcgraph_entry: 0.179 us | > > deactivate_page(); > > > > (The irrelevant bits are removed for brevity.) > > > > It makes me think there may be a regression in MADV_COLD. Please let me know > > what do you reckon? > > Since deactive_page is called, I guess that's not a regression(?) from [1] > > Then, my random guess that you mentioned "Swap" as regression might be > related to "workingset detection for anon page" since kernel changes balancing > policy between file and anonymous LRU, which was merged into v5.8. > It would be helpful to see if you try it on v5.7 and v5.8. > > [1] 12e967fd8e4e6, mm: do not allow MADV_PAGEOUT for CoW page Yes, I noticed this for a while. With commit b518154e59a ("mm/vmscan: protect the workingset on anonymous LRU"), anon/shmem pages start on the inactive lru, and in this case, deactive_page() is a NOP. Before 5.9, anon/shmem pages start on the active lru, deactive_page() moves zero pages in the test to the inactive lru and therefore protests the "blob". This should fix the problem for your test case: diff --git a/mm/swap.c b/mm/swap.c index bcf3ac288b56..7fd99f037ca7 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -563,7 +559,7 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec) static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec) { - if (PageActive(page) && !PageUnevictable(page)) { + if (!PageUnevictable(page)) { int nr_pages = thp_nr_pages(page); del_page_from_lru_list(page, lruvec); @@ -677,7 +673,7 @@ void deactivate_file_page(struct page *page) */ void deactivate_page(struct page *page) { - if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) { + if (PageLRU(page) && !PageUnevictable(page)) { struct pagevec *pvec; local_lock(&lru_pvecs.lock); I'll leave it to Minchan to decide whether this is worth fixing, together with this one: diff --git a/mm/swap.c b/mm/swap.c index bcf3ac288b56..2f142f09c8e1 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -529,10 +529,6 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec) if (PageUnevictable(page)) return; - /* Some processes are using the page */ - if (page_mapped(page)) - return; - del_page_from_lru_list(page, lruvec); ClearPageActive(page); ClearPageReferenced(page);