From: Huang Ying <ying.huang@xxxxxxxxx> The original assumption of auto NUMA balancing is that the memory privately or mainly accessed by the CPUs of a NUMA node (called private memory) should fit the memory size of the NUMA node. So if a page is identified to be private memory, it will be migrated to the target node immediately. Eventually all private memory will be migrated. But this assumption isn't true in memory tiering system. In a typical memory tiering system, there are CPUs, fast memory, and slow memory in each physical NUMA node. The CPUs and the fast memory will be put in one logical node (called fast memory node), while the slow memory will be put in another (faked) logical node (called slow memory node). To take full advantage of the system resources, it's common that the size of the private memory of the workload is larger than the memory size of the fast memory node. To resolve the issue, we will try to migrate only the hot pages in the private memory to the fast memory node. A private page that was accessed at least twice in the current and the last scanning periods will be identified as the hot page and migrated. Otherwise, the page isn't considered hot enough to be migrated. To record whether a page is accessed in the last scanning period, the Accessed bit of the PTE/PMD is used. When the page tables are scanned for autonuma, if the pte_protnone(pte) is true, the page isn't accessed in the last scan period, and the Accessed bit will be cleared, otherwise the Accessed bit will be kept. When NUMA page fault occurs, if the Accessed bit is set, the page has been accessed at least twice in the current and the last scanning period and will be migrated. The Accessed bit of PTE/PMD is used by page reclaiming too. So the conflict is possible. Considering the following situation, a) the page is moved from active list to inactive list with Accessed bit cleared b) the page is accessed, so Accessed bit is set c) the page table is scanned by autonuma, PTE is set to PROTNONE+Accessed c) the page isn't accessed d) the page table is scanned by autonuma again, Accessed is cleared e) the inactive list is scanned for reclaiming, the page is reclaimed wrongly because Accessed bit is cleared by autonuma Although the page is reclaimed wrongly, it hasn't been accessed for one numa balancing scanning period at least. So the page isn't so hot too. That is, this shouldn't be a severe issue. The patch improves the score of pmbench memory accessing benchmark with 80:20 read/write ratio and normal access address distribution by 3.1% on a 2 socket Intel server with Optance DC Persistent Memory. In the test, the number of the promoted pages for autonuma reduces 7.2% because the pages fail to pass the twice access checking. Problems: - how to adjust scanning period upon hot page identification requirement. E.g. if the count of page promotion is much larger than free memory, we need to scan faster to identify really hot pages. But his will trigger too many page faults too. Signed-off-by: "Huang, Ying" <ying.huang@xxxxxxxxx> Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> Cc: Michal Hocko <mhocko@xxxxxxxx> Cc: Rik van Riel <riel@xxxxxxxxxx> Cc: Mel Gorman <mgorman@xxxxxxx> Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx> Cc: Ingo Molnar <mingo@xxxxxxxxxx> Cc: Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx> Cc: Dan Williams <dan.j.williams@xxxxxxxxx> Cc: Fengguang Wu <fengguang.wu@xxxxxxxxx> Cc: linux-kernel@xxxxxxxxxxxxxxx Cc: linux-mm@xxxxxxxxx --- mm/huge_memory.c | 13 ++++++++++++- mm/memory.c | 28 +++++++++++++++------------- mm/mprotect.c | 11 ++++++++++- 3 files changed, 37 insertions(+), 15 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 61e241ce20fa..7634fb22931b 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1528,6 +1528,10 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t pmd) if (unlikely(!pmd_same(pmd, *vmf->pmd))) goto out_unlock; + /* Only migrate if accessed twice */ + if (!pmd_young(*vmf->pmd)) + goto out_unlock; + /* * If there are potential migrations, wait for completion and retry * without disrupting NUMA hinting information. Do not relock and @@ -1948,8 +1952,15 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, if (is_huge_zero_pmd(*pmd)) goto unlock; - if (pmd_protnone(*pmd)) + if (pmd_protnone(*pmd)) { + /* + * PMD young bit is used to record whether the + * page is accessed in last scan period + */ + if (pmd_young(*pmd)) + set_pmd_at(mm, addr, pmd, pmd_mkold(*pmd)); goto unlock; + } page = pmd_page(*pmd); /* diff --git a/mm/memory.c b/mm/memory.c index ddf20bd0c317..e5da50eca36f 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3711,10 +3711,8 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf) */ vmf->ptl = pte_lockptr(vma->vm_mm, vmf->pmd); spin_lock(vmf->ptl); - if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte))) { - pte_unmap_unlock(vmf->pte, vmf->ptl); - goto out; - } + if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte))) + goto unmap_out; /* * Make it present again, Depending on how arch implementes non @@ -3728,17 +3726,17 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf) ptep_modify_prot_commit(vma, vmf->address, vmf->pte, old_pte, pte); update_mmu_cache(vma, vmf->address, vmf->pte); + /* Only migrate if accessed twice */ + if (!pte_young(old_pte)) + goto unmap_out; + page = vm_normal_page(vma, vmf->address, pte); - if (!page) { - pte_unmap_unlock(vmf->pte, vmf->ptl); - return 0; - } + if (!page) + goto unmap_out; /* TODO: handle PTE-mapped THP */ - if (PageCompound(page)) { - pte_unmap_unlock(vmf->pte, vmf->ptl); - return 0; - } + if (PageCompound(page)) + goto unmap_out; /* * Avoid grouping on RO pages in general. RO pages shouldn't hurt as @@ -3776,10 +3774,14 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf) } else flags |= TNF_MIGRATE_FAIL; -out: if (page_nid != NUMA_NO_NODE) task_numa_fault(last_cpupid, page_nid, 1, flags); return 0; + +unmap_out: + pte_unmap_unlock(vmf->pte, vmf->ptl); +out: + return 0; } static inline vm_fault_t create_huge_pmd(struct vm_fault *vmf) diff --git a/mm/mprotect.c b/mm/mprotect.c index 0636f2e5e05b..fd8d8e813717 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -83,8 +83,17 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, int nid; /* Avoid TLB flush if possible */ - if (pte_protnone(oldpte)) + if (pte_protnone(oldpte)) { + /* + * PTE young bit is used to record + * whether the page is accessed in + * last scan period + */ + if (pte_young(oldpte)) + set_pte_at(vma->vm_mm, addr, pte, + pte_mkold(oldpte)); continue; + } page = vm_normal_page(vma, addr, oldpte); if (!page || PageKsm(page)) -- 2.23.0