From: Liu Jingqi <jingqi.liu@xxxxxxxxx> Introduce MPOL_MF_SW_YOUNG flag to move_pages(). When on, the already-in-DRAM pages will be set PG_referenced. Background: The use space migration daemon will frequently scan page table and read-clear accessed bits to detect hot/cold pages. Then migrate hot pages from PMEM to DRAM node. When doing so, it btw tells kernel that these are the hot page set. This maintains a persistent view of hot/cold pages between kernel and user space daemon. The more concrete steps are 1) do multiple scan of page table, count accessed bits 2) highest accessed count => hot pages 3) call move_pages(hot pages, DRAM nodes, MPOL_MF_SW_YOUNG) (1) regularly clears PTE young, which makes kernel lose access to PTE young information (2) for anonymous pages, user space daemon defines which is hot and which is cold (3) conveys user space view of hot/cold pages to kernel through PG_referenced In the long run, most hot pages could already be in DRAM. move_pages(MPOL_MF_SW_YOUNG) sets PG_referenced for those already in DRAM hot pages. But not for newly migrated hot pages. Since they are expected to put to the end of LRU, thus has long enough time in LRU to gather accessed/PG_referenced bit and prove to kernel they are really hot. The daemon may only select DRAM/2 pages as hot for 2 purposes: - avoid thrashing, eg. some warm pages got promoted then demoted soon - make sure enough DRAM LRU pages look "cold" to kernel, so that vmscan won't run into trouble busy scanning LRU lists Signed-off-by: Liu Jingqi <jingqi.liu@xxxxxxxxx> Signed-off-by: Fengguang Wu <fengguang.wu@xxxxxxxxx> --- mm/migrate.c | 13 ++++++++++--- 1 file changed, 10 insertions(+), 3 deletions(-) --- linux.orig/mm/migrate.c 2018-12-23 20:37:12.604621319 +0800 +++ linux/mm/migrate.c 2018-12-23 20:37:12.604621319 +0800 @@ -55,6 +55,8 @@ #include "internal.h" +#define MPOL_MF_SW_YOUNG (1<<7) + /* * migrate_prep() needs to be called before we start compiling a list of pages * to be migrated using isolate_lru_page(). If scheduling work on other CPUs is @@ -1484,12 +1486,13 @@ static int do_move_pages_to_node(struct * the target node */ static int add_page_for_migration(struct mm_struct *mm, unsigned long addr, - int node, struct list_head *pagelist, bool migrate_all) + int node, struct list_head *pagelist, int flags) { struct vm_area_struct *vma; struct page *page; unsigned int follflags; int err; + bool migrate_all = flags & MPOL_MF_MOVE_ALL; down_read(&mm->mmap_sem); err = -EFAULT; @@ -1519,6 +1522,8 @@ static int add_page_for_migration(struct if (PageHuge(page)) { if (PageHead(page)) { + if (flags & MPOL_MF_SW_YOUNG) + SetPageReferenced(page); isolate_huge_page(page, pagelist); err = 0; } @@ -1531,6 +1536,8 @@ static int add_page_for_migration(struct goto out_putpage; err = 0; + if (flags & MPOL_MF_SW_YOUNG) + SetPageReferenced(head); list_add_tail(&head->lru, pagelist); mod_node_page_state(page_pgdat(head), NR_ISOLATED_ANON + page_is_file_cache(head), @@ -1606,7 +1613,7 @@ static int do_pages_move(struct mm_struc * report them via status */ err = add_page_for_migration(mm, addr, current_node, - &pagelist, flags & MPOL_MF_MOVE_ALL); + &pagelist, flags); if (!err) continue; @@ -1725,7 +1732,7 @@ static int kernel_move_pages(pid_t pid, nodemask_t task_nodes; /* Check flags */ - if (flags & ~(MPOL_MF_MOVE|MPOL_MF_MOVE_ALL)) + if (flags & ~(MPOL_MF_MOVE|MPOL_MF_MOVE_ALL|MPOL_MF_SW_YOUNG)) return -EINVAL; if ((flags & MPOL_MF_MOVE_ALL) && !capable(CAP_SYS_NICE))