On Tue, Mar 1, 2022 at 12:54 AM Huang Ying <ying.huang@xxxxxxxxx> wrote: > > If the NUMA balancing isn't used to optimize the page placement among > sockets but only among memory types, the hot pages in the fast memory > node couldn't be migrated (promoted) to anywhere. So it's unnecessary > to scan the pages in the fast memory node via changing their PTE/PMD > mapping to be PROT_NONE. So that the page faults could be avoided > too. > > In the test, if only the memory tiering NUMA balancing mode is enabled, the > number of the NUMA balancing hint faults for the DRAM node is reduced to > almost 0 with the patch. While the benchmark score doesn't change > visibly. Reviewed-by: Yang Shi <shy828301@xxxxxxxxx> > > Signed-off-by: "Huang, Ying" <ying.huang@xxxxxxxxx> > Suggested-by: Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx> > Tested-by: Baolin Wang <baolin.wang@xxxxxxxxxxxxxxxxx> > Reviewed-by: Baolin Wang <baolin.wang@xxxxxxxxxxxxxxxxx> > Acked-by: Johannes Weiner <hannes@xxxxxxxxxxx> > Reviewed-by: Oscar Salvador <osalvador@xxxxxxx> > Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> > Cc: Michal Hocko <mhocko@xxxxxxxx> > Cc: Rik van Riel <riel@xxxxxxxxxxx> > Cc: Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> > Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx> > Cc: Yang Shi <shy828301@xxxxxxxxx> > Cc: Zi Yan <ziy@xxxxxxxxxx> > Cc: Wei Xu <weixugc@xxxxxxxxxx> > Cc: Shakeel Butt <shakeelb@xxxxxxxxxx> > Cc: zhongjiang-ali <zhongjiang-ali@xxxxxxxxxxxxxxxxx> > Cc: linux-kernel@xxxxxxxxxxxxxxx > Cc: linux-mm@xxxxxxxxx > --- > mm/huge_memory.c | 30 +++++++++++++++++++++--------- > mm/mprotect.c | 13 ++++++++++++- > 2 files changed, 33 insertions(+), 10 deletions(-) > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index 406a3c28c026..9ce126cb0cfd 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -34,6 +34,7 @@ > #include <linux/oom.h> > #include <linux/numa.h> > #include <linux/page_owner.h> > +#include <linux/sched/sysctl.h> > > #include <asm/tlb.h> > #include <asm/pgalloc.h> > @@ -1766,17 +1767,28 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, > } > #endif > > - /* > - * Avoid trapping faults against the zero page. The read-only > - * data is likely to be read-cached on the local CPU and > - * local/remote hits to the zero page are not interesting. > - */ > - if (prot_numa && is_huge_zero_pmd(*pmd)) > - goto unlock; > + if (prot_numa) { > + struct page *page; > + /* > + * Avoid trapping faults against the zero page. The read-only > + * data is likely to be read-cached on the local CPU and > + * local/remote hits to the zero page are not interesting. > + */ > + if (is_huge_zero_pmd(*pmd)) > + goto unlock; > > - if (prot_numa && pmd_protnone(*pmd)) > - goto unlock; > + if (pmd_protnone(*pmd)) > + goto unlock; > > + page = pmd_page(*pmd); > + /* > + * Skip scanning top tier node if normal numa > + * balancing is disabled > + */ > + if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_NORMAL) && > + node_is_toptier(page_to_nid(page))) > + goto unlock; > + } > /* > * In case prot_numa, we are under mmap_read_lock(mm). It's critical > * to not clear pmd intermittently to avoid race with MADV_DONTNEED > diff --git a/mm/mprotect.c b/mm/mprotect.c > index 0138dfcdb1d8..2fe03e695c81 100644 > --- a/mm/mprotect.c > +++ b/mm/mprotect.c > @@ -29,6 +29,7 @@ > #include <linux/uaccess.h> > #include <linux/mm_inline.h> > #include <linux/pgtable.h> > +#include <linux/sched/sysctl.h> > #include <asm/cacheflush.h> > #include <asm/mmu_context.h> > #include <asm/tlbflush.h> > @@ -83,6 +84,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, > */ > if (prot_numa) { > struct page *page; > + int nid; > > /* Avoid TLB flush if possible */ > if (pte_protnone(oldpte)) > @@ -109,7 +111,16 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, > * Don't mess with PTEs if page is already on the node > * a single-threaded process is running on. > */ > - if (target_node == page_to_nid(page)) > + nid = page_to_nid(page); > + if (target_node == nid) > + continue; > + > + /* > + * Skip scanning top tier node if normal numa > + * balancing is disabled > + */ > + if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_NORMAL) && > + node_is_toptier(nid)) > continue; > } > > -- > 2.30.2 >