Baolin Wang <baolin.wang@xxxxxxxxxxxxxxxxx> writes: > Now the anonymous page allocation already supports multi-size THP (mTHP), > but the numa balancing still prohibits mTHP migration even though it is an > exclusive mapping, which is unreasonable. Thus let's support the exclusive > mTHP numa balancing firstly. > > Allow scanning mTHP: > Commit 859d4adc3415 ("mm: numa: do not trap faults on shared data section > pages") skips shared CoW pages' NUMA page migration to avoid shared data > segment migration. In addition, commit 80d47f5de5e3 ("mm: don't try to > NUMA-migrate COW pages that have other uses") change to use page_count() > to avoid GUP pages migration, that will also skip the mTHP numa scaning. > Theoretically, we can use folio_maybe_dma_pinned() to detect the GUP > issue, although there is still a GUP race, the issue seems to have been > resolved by commit 80d47f5de5e3. Meanwhile, use the folio_estimated_sharers() > to skip shared CoW pages though this is not a precise sharers count. To > check if the folio is shared, ideally we want to make sure every page is > mapped to the same process, but doing that seems expensive and using > the estimated mapcount seems can work when running autonuma benchmark. > > Allow migrating mTHP: > As mentioned in the previous thread[1], large folios are more susceptible > to false sharing issues, leading to pages ping-pong back and forth during > numa balancing, which is currently hard to resolve. Therefore, as a start to > support mTHP numa balancing, only exclusive mappings are allowed to perform > numa migration to avoid the false sharing issues with large folios. Similarly, > use the estimated mapcount to skip shared mappings, which seems can work > in most cases (?), and we've used folio_estimated_sharers() to skip shared > mappings in migrate_misplaced_folio() for numa balancing, seems no real > complaints. IIUC, folio_estimated_sharers() cannot identify multi-thread applications. If some mTHP is shared by multiple threads in one process, how to deal with that? For example, I think that we should avoid to migrate on the first fault for mTHP in should_numa_migrate_memory(). More thoughts? Can we add a field in struct folio for mTHP to count hint page faults from the same node? -- Best Regards, Huang, Ying > Performance data: > Machine environment: 2 nodes, 128 cores Intel(R) Xeon(R) Platinum > Base: 2024-3-15 mm-unstable branch > Enable mTHP=64K to run autonuma-benchmark > > Base without the patch: > numa01 > 222.97 > numa01_THREAD_ALLOC > 115.78 > numa02 > 13.04 > numa02_SMT > 14.69 > > Base with the patch: > numa01 > 125.36 > numa01_THREAD_ALLOC > 44.58 > numa02 > 9.22 > numa02_SMT > 7.46 > > [1] https://lore.kernel.org/all/20231117100745.fnpijbk4xgmals3k@xxxxxxxxxxxxxxxxxxx/ > Signed-off-by: Baolin Wang <baolin.wang@xxxxxxxxxxxxxxxxx> > --- > Changes from RFC v1: > - Add some preformance data per Huang, Ying. > - Allow mTHP scanning per David Hildenbrand. > - Avoid sharing mapping for numa balancing to avoid false sharing. > - Add more commit message. > --- > mm/memory.c | 9 +++++---- > mm/mprotect.c | 3 ++- > 2 files changed, 7 insertions(+), 5 deletions(-) > > diff --git a/mm/memory.c b/mm/memory.c > index f2bc6dd15eb8..b9d5d88c5a76 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -5059,7 +5059,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf) > int last_cpupid; > int target_nid; > pte_t pte, old_pte; > - int flags = 0; > + int flags = 0, nr_pages = 0; > > /* > * The pte cannot be used safely until we verify, while holding the page > @@ -5089,8 +5089,8 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf) > if (!folio || folio_is_zone_device(folio)) > goto out_map; > > - /* TODO: handle PTE-mapped THP */ > - if (folio_test_large(folio)) > + /* Avoid large folio false sharing */ > + if (folio_test_large(folio) && folio_estimated_sharers(folio) > 1) > goto out_map; > > /* > @@ -5112,6 +5112,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf) > flags |= TNF_SHARED; > > nid = folio_nid(folio); > + nr_pages = folio_nr_pages(folio); > /* > * For memory tiering mode, cpupid of slow memory page is used > * to record page access time. So use default value. > @@ -5148,7 +5149,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf) > > out: > if (nid != NUMA_NO_NODE) > - task_numa_fault(last_cpupid, nid, 1, flags); > + task_numa_fault(last_cpupid, nid, nr_pages, flags); > return 0; > out_map: > /* > diff --git a/mm/mprotect.c b/mm/mprotect.c > index f8a4544b4601..f0b9c974aaae 100644 > --- a/mm/mprotect.c > +++ b/mm/mprotect.c > @@ -129,7 +129,8 @@ static long change_pte_range(struct mmu_gather *tlb, > > /* Also skip shared copy-on-write pages */ > if (is_cow_mapping(vma->vm_flags) && > - folio_ref_count(folio) != 1) > + (folio_maybe_dma_pinned(folio) || > + folio_estimated_sharers(folio) > 1)) > continue; > > /*