Re: [RFC PATCH 3/3] mm/migrate: move common code to numa_migrate_check (was numa_migrate_prep)

"Huang, Ying" <ying.huang@xxxxxxxxx> · Tue, 23 Jul 2024 09:16:01 +0800

Zi Yan <ziy@xxxxxxxxxx> writes:

> On 22 Jul 2024, at 10:01, Zi Yan wrote:
>
>> On 21 Jul 2024, at 21:47, Huang, Ying wrote:
>>
>>> Zi Yan <ziy@xxxxxxxxxx> writes:
>>>
>>>> On 18 Jul 2024, at 4:36, Huang, Ying wrote:
>>>>
>>>>> Zi Yan <zi.yan@xxxxxxxx> writes:
>>>>>
>>>>>> From: Zi Yan <ziy@xxxxxxxxxx>
>>>>>>
>>>>>> do_numa_page() and do_huge_pmd_numa_page() share a lot of common code. To
>>>>>> reduce redundancy, move common code to numa_migrate_prep() and rename
>>>>>> the function to numa_migrate_check() to reflect its functionality.
>>>>>>
>>>>>> There is some code difference between do_numa_page() and
>>>>>> do_huge_pmd_numa_page() before the code move:
>>>>>>
>>>>>> 1. do_huge_pmd_numa_page() did not check shared folios to set TNF_SHARED.
>>>>>> 2. do_huge_pmd_numa_page() did not check and skip zone device folios.
>>>>>>
>>>>>> Signed-off-by: Zi Yan <ziy@xxxxxxxxxx>
>>>>>> ---
>>>>>>  mm/huge_memory.c | 28 ++++++-----------
>>>>>>  mm/internal.h    |  5 +--
>>>>>>  mm/memory.c      | 81 +++++++++++++++++++++++-------------------------
>>>>>>  3 files changed, 52 insertions(+), 62 deletions(-)
>>>>>>
>>>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>>>> index 8c11d6da4b36..66d67d13e0dc 100644
>>>>>> --- a/mm/huge_memory.c
>>>>>> +++ b/mm/huge_memory.c
>>>>>> @@ -1670,10 +1670,10 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>>>>>>  	pmd_t pmd;
>>>>>>  	struct folio *folio;
>>>>>>  	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
>>>>>> -	int nid = NUMA_NO_NODE;
>>>>>> -	int target_nid, last_cpupid = (-1 & LAST_CPUPID_MASK);
>>>>>> +	int target_nid = NUMA_NO_NODE;
>>>>>> +	int last_cpupid = (-1 & LAST_CPUPID_MASK);
>>>>>>  	bool writable = false;
>>>>>> -	int flags = 0;
>>>>>> +	int flags = 0, nr_pages;
>>>>>>
>>>>>>  	vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
>>>>>>  	if (unlikely(!pmd_same(oldpmd, *vmf->pmd))) {
>>>>>> @@ -1693,21 +1693,13 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>>>>>>  		writable = true;
>>>>>>
>>>>>>  	folio = vm_normal_folio_pmd(vma, haddr, pmd);
>>>>>> -	if (!folio)
>>>>>> +	if (!folio || folio_is_zone_device(folio))
>>>>>
>>>>> This change appears unrelated.  Can we put it in a separate patch?
>>>>>
>>>>> IIUC, this isn't necessary even in do_numa_page()?  Because in
>>>>> change_pte_range(), folio_is_zone_device() has been checked already.
>>>>> But It doesn't hurt too.
>>>>>
>>>>>>  		goto out_map;
>>>>>>
>>>>>> -	/* See similar comment in do_numa_page for explanation */
>>>>>> -	if (!writable)
>>>>>> -		flags |= TNF_NO_GROUP;
>>>>>> +	nr_pages = folio_nr_pages(folio);
>>>>>>
>>>>>> -	nid = folio_nid(folio);
>>>>>> -	/*
>>>>>> -	 * For memory tiering mode, cpupid of slow memory page is used
>>>>>> -	 * to record page access time.  So use default value.
>>>>>> -	 */
>>>>>> -	if (folio_has_cpupid(folio))
>>>>>> -		last_cpupid = folio_last_cpupid(folio);
>>>>>> -	target_nid = numa_migrate_prep(folio, vmf, haddr, nid, &flags);
>>>>>> +	target_nid = numa_migrate_check(folio, vmf, haddr, writable,
>>>>>> +			&flags, &last_cpupid);
>>>>>>  	if (target_nid == NUMA_NO_NODE)
>>>>>>  		goto out_map;
>>>>>>  	if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) {
>>>>>> @@ -1720,8 +1712,8 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>>>>>>
>>>>>>  	if (!migrate_misplaced_folio(folio, vma, target_nid)) {
>>>>>>  		flags |= TNF_MIGRATED;
>>>>>> -		nid = target_nid;
>>>>>>  	} else {
>>>>>> +		target_nid = NUMA_NO_NODE;
>>>>>>  		flags |= TNF_MIGRATE_FAIL;
>>>>>>  		vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
>>>>>>  		if (unlikely(!pmd_same(oldpmd, *vmf->pmd))) {
>>>>>> @@ -1732,8 +1724,8 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>>>>>>  	}
>>>>>>
>>>>>>  out:
>>>>>> -	if (nid != NUMA_NO_NODE)
>>>>>> -		task_numa_fault(last_cpupid, nid, HPAGE_PMD_NR, flags);
>>>>>> +	if (target_nid != NUMA_NO_NODE)
>>>>>> +		task_numa_fault(last_cpupid, target_nid, nr_pages, flags);
>>>>>
>>>>> This appears a behavior change.  IIUC, there are 2 possible issues.
>>>>>
>>>>> 1) if migrate_misplaced_folio() fails, folio_nid() should be used as
>>>>> nid.  "target_nid" as variable name here is confusing, because
>>>>> folio_nid() is needed in fact.
>>>>>
>>>>> 2) if !pmd_same(), task_numa_fault() should be skipped.  The original
>>>>> code is buggy.
>>>>>
>>>>> Similar issues for do_numa_page().
>>>>>
>>>>> If my understanding were correct, we should implement a separate patch
>>>>> to fix 2) above.  And that may need to be backported.
>>>>
>>>> Hmm, the original code seems OK after I checked the implementation.
>>>> There are two possible !pte_same()/!pmd_same() locations:
>>>> 1) at the beginning of do_numa_page() and do_huge_pmd_numa_page() and the faulted
>>>> PTE/PMD changed before the folio can be checked, task_numa_fault() should not be
>>>> called.
>>>
>>> Yes.
>>>
>>>> 2) when migrate_misplaced_folio() failed and the PTE/PMD changed, but the folio
>>>> has been determined and checked. task_numa_fault() should be called even if
>>>> !pte_same()/!pmd_same(),
>>>
>>> IIUC, if !pte_same()/!pmd_same(), the fault has been processed on
>>> another CPU.  For example, do_numa_page()/do_huge_pmd_numa_page() has
>>> been called on another CPU and task_numa_fault() has been called for the
>>> PTE/PMD already.
>>
>> Hmm, this behavior at least dates back to 2015 at
>> commit 074c238177a7 ("mm: numa: slow PTE scan rate if migration failures occur”).
>> So cc Mel.
>>
>> The code is https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/memory.c?id=074c238177a75f5e79af3b2cb6a84e54823ef950#n3102. I have not checked older
>> commits.
>>
>> I wonder how far we should trace back.
>
> OK, I find the commit where task_numa_fault policy settled:
> 8191acbd30c7 ("mm: numa: Sanitize task_numa_fault() callsites”).
>
> It says:
> “So modify all three sites to always account; we did after all receive
> the fault; and always account to where the page is after migration,
> regardless of success.“, where the three call sites were:
> do_huge_pmd_numa_page(), do_numa_page(), and do_pmd_numa_page().
>
> The current code still follows what the commit log does.

Per my understanding, the issue is introduced in commit b99a342d4f11
("NUMA balancing: reduce TLB flush via delaying mapping on hint page
fault").  Before that, the PTE is restored before migration, so
task_numa_fault() should be called for migration failure too.  After
that, the PTE is restored after migration failure, if the PTE has been
changed by someone else, someone else should have called
task_numa_fault() if necessary, we shouldn't call it again.

--
Best Regards,
Huang, Ying