Re: [RFC PATCH v2] mm: support multi-size THP numa balancing

"Huang, Ying" <ying.huang@xxxxxxxxx> · Mon, 18 Mar 2024 14:16:59 +0800

Baolin Wang <baolin.wang@xxxxxxxxxxxxxxxxx> writes:

> Now the anonymous page allocation already supports multi-size THP (mTHP),
> but the numa balancing still prohibits mTHP migration even though it is an
> exclusive mapping, which is unreasonable. Thus let's support the exclusive
> mTHP numa balancing firstly.
>
> Allow scanning mTHP:
> Commit 859d4adc3415 ("mm: numa: do not trap faults on shared data section
> pages") skips shared CoW pages' NUMA page migration to avoid shared data
> segment migration. In addition, commit 80d47f5de5e3 ("mm: don't try to
> NUMA-migrate COW pages that have other uses") change to use page_count()
> to avoid GUP pages migration, that will also skip the mTHP numa scaning.
> Theoretically, we can use folio_maybe_dma_pinned() to detect the GUP
> issue, although there is still a GUP race, the issue seems to have been
> resolved by commit 80d47f5de5e3. Meanwhile, use the folio_estimated_sharers()
> to skip shared CoW pages though this is not a precise sharers count. To
> check if the folio is shared, ideally we want to make sure every page is
> mapped to the same process, but doing that seems expensive and using
> the estimated mapcount seems can work when running autonuma benchmark.
>
> Allow migrating mTHP:
> As mentioned in the previous thread[1], large folios are more susceptible
> to false sharing issues, leading to pages ping-pong back and forth during
> numa balancing, which is currently hard to resolve. Therefore, as a start to
> support mTHP numa balancing, only exclusive mappings are allowed to perform
> numa migration to avoid the false sharing issues with large folios. Similarly,
> use the estimated mapcount to skip shared mappings, which seems can work
> in most cases (?), and we've used folio_estimated_sharers() to skip shared
> mappings in migrate_misplaced_folio() for numa balancing, seems no real
> complaints.

IIUC, folio_estimated_sharers() cannot identify multi-thread
applications.  If some mTHP is shared by multiple threads in one
process, how to deal with that?

For example, I think that we should avoid to migrate on the first fault
for mTHP in should_numa_migrate_memory().

More thoughts?  Can we add a field in struct folio for mTHP to count
hint page faults from the same node?

--
Best Regards,
Huang, Ying

> Performance data:
> Machine environment: 2 nodes, 128 cores Intel(R) Xeon(R) Platinum
> Base: 2024-3-15 mm-unstable branch
> Enable mTHP=64K to run autonuma-benchmark
>
> Base without the patch:
> numa01
> 222.97
> numa01_THREAD_ALLOC
> 115.78
> numa02
> 13.04
> numa02_SMT
> 14.69
>
> Base with the patch:
> numa01
> 125.36
> numa01_THREAD_ALLOC
> 44.58
> numa02
> 9.22
> numa02_SMT
> 7.46
>
> [1] https://lore.kernel.org/all/20231117100745.fnpijbk4xgmals3k@xxxxxxxxxxxxxxxxxxx/
> Signed-off-by: Baolin Wang <baolin.wang@xxxxxxxxxxxxxxxxx>
> ---
> Changes from RFC v1:
>  - Add some preformance data per Huang, Ying.
>  - Allow mTHP scanning per David Hildenbrand.
>  - Avoid sharing mapping for numa balancing to avoid false sharing.
>  - Add more commit message.
> ---
>  mm/memory.c   | 9 +++++----
>  mm/mprotect.c | 3 ++-
>  2 files changed, 7 insertions(+), 5 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index f2bc6dd15eb8..b9d5d88c5a76 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -5059,7 +5059,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
>  	int last_cpupid;
>  	int target_nid;
>  	pte_t pte, old_pte;
> -	int flags = 0;
> +	int flags = 0, nr_pages = 0;
>  
>  	/*
>  	 * The pte cannot be used safely until we verify, while holding the page
> @@ -5089,8 +5089,8 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
>  	if (!folio || folio_is_zone_device(folio))
>  		goto out_map;
>  
> -	/* TODO: handle PTE-mapped THP */
> -	if (folio_test_large(folio))
> +	/* Avoid large folio false sharing */
> +	if (folio_test_large(folio) && folio_estimated_sharers(folio) > 1)
>  		goto out_map;
>  
>  	/*
> @@ -5112,6 +5112,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
>  		flags |= TNF_SHARED;
>  
>  	nid = folio_nid(folio);
> +	nr_pages = folio_nr_pages(folio);
>  	/*
>  	 * For memory tiering mode, cpupid of slow memory page is used
>  	 * to record page access time.  So use default value.
> @@ -5148,7 +5149,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
>  
>  out:
>  	if (nid != NUMA_NO_NODE)
> -		task_numa_fault(last_cpupid, nid, 1, flags);
> +		task_numa_fault(last_cpupid, nid, nr_pages, flags);
>  	return 0;
>  out_map:
>  	/*
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index f8a4544b4601..f0b9c974aaae 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -129,7 +129,8 @@ static long change_pte_range(struct mmu_gather *tlb,
>  
>  				/* Also skip shared copy-on-write pages */
>  				if (is_cow_mapping(vma->vm_flags) &&
> -				    folio_ref_count(folio) != 1)
> +				    (folio_maybe_dma_pinned(folio) ||
> +				     folio_estimated_sharers(folio) > 1))
>  					continue;
>  
>  				/*