Re: [PATCH v5 1/6] mm: free zapped tail pages when splitting isolated thp

Usama Arif <usamaarif642@xxxxxxxxx> · Thu, 5 Sep 2024 20:24:49 +0100

On 05/09/2024 19:05, Hugh Dickins wrote:
> On Thu, 5 Sep 2024, Usama Arif wrote:
>> On 05/09/2024 09:46, Hugh Dickins wrote:
>>> On Fri, 30 Aug 2024, Usama Arif wrote:
>>>
>>>> From: Yu Zhao <yuzhao@xxxxxxxxxx>
>>>>
>>>> If a tail page has only two references left, one inherited from the
>>>> isolation of its head and the other from lru_add_page_tail() which we
>>>> are about to drop, it means this tail page was concurrently zapped.
>>>> Then we can safely free it and save page reclaim or migration the
>>>> trouble of trying it.
>>>>
>>>> Signed-off-by: Yu Zhao <yuzhao@xxxxxxxxxx>
>>>> Tested-by: Shuang Zhai <zhais@xxxxxxxxxx>
>>>> Acked-by: Johannes Weiner <hannes@xxxxxxxxxxx>
>>>> Signed-off-by: Usama Arif <usamaarif642@xxxxxxxxx>
>>>
>>> I'm sorry, but I think this patch (just this 1/6) needs to be dropped:
>>> it is only an optimization, and unless a persuasive performance case
>>> can be made to extend it, it ought to go (perhaps revisited later).
>>>
>>
>> I am ok for patch 1 only to be dropped. Patches 2-6 are not dependent on it.
>>
>> Its an optimization and underused shrinker doesn't depend on it.
>> Its possible that folio->new_folio below might fix it? But if it doesn't,
>> I can retry later on to make this work and resend it only if it alone shows
>> a significant performance improvement.
>>
>> Thanks a lot for debugging this! and sorry it caused an issue.
>>
>>
>>> The problem I kept hitting was that all my work, requiring compaction and
>>> reclaim, got (killably) stuck in or repeatedly calling reclaim_throttle():
>>> because nr_isolated_anon had grown high - and remained high even when the
>>> load had all been killed.
>>>
>>> Bisection led to the 2/6 (remap to shared zeropage), but I'd say this 1/6
>>> is the one to blame. I was intending to send this patch to "fix" it:
>>>
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -3295,6 +3295,8 @@ static void __split_huge_page(struct pag
>>>  			folio_clear_active(new_folio);
>>>  			folio_clear_unevictable(new_folio);
>>>  			list_del(&new_folio->lru);
>>> +			node_stat_sub_folio(folio, NR_ISOLATED_ANON +
>>> +						folio_is_file_lru(folio));
>>
>> Maybe this should have been below? (Notice the folio->new_folio)
>>
>> +			node_stat_sub_folio(new_folio, NR_ISOLATED_ANON +
>> +						folio_is_file_lru(new_folio));
> 
> Yes, how very stupid of me (I'm well aware of the earlier bugfixes here,
> and ought not to have done a blind copy and paste of those lines): thanks.
> 
> However... it makes no difference. I gave yours a run, expecting a
> much smaller negative number, but actually it works out much the same:
> because, of course, by this point in the code "folio" is left pointing
> to a small folio, and is almost equivalent to new_folio for the
> node_stat calculations.
> 
> (I say "almost" because I guess there's a chance that the page at
> folio got reused as part of a larger folio meanwhile, which would
> throw the stats off (if they weren't off already).)
> 
> So, even with your fix to my fix, this code doesn't work.
> Whereas a revert of this 1/6 does work: nr_isolated_anon and
> nr_isolated_file come to 0 when the system is at rest, as expected
> (and as silence from vmstat_refresh confirms - /proc/vmstat itself
> presents negative stats as 0, in order to hide transient oddities).
> 
> Hugh

Thanks for trying. I was hoping you had mTHPs enabled and then
the folio -> new_folio change would have fixed it.

Happy for patch 1 only to be dropped. I can try to figure it out
later and send if I can actually show any performance numbers
for the fixed version on real world cases.

Thanks,
Usama
> 
>>
>>>  			if (!folio_batch_add(&free_folios, new_folio)) {
>>>  				mem_cgroup_uncharge_folios(&free_folios);
>>>  				free_unref_folios(&free_folios);
>>>
>>> And that ran nicely, until I terminated the run and did
>>> grep nr_isolated /proc/sys/vm/stat_refresh /proc/vmstat
>>> at the end: stat_refresh kindly left a pr_warn in dmesg to say
>>> nr_isolated_anon -334013737
>>>
>>> My patch is not good enough. IIUC, some split_huge_pagers (reclaim?)
>>> know how many pages they isolated and decremented the stats by, and
>>> increment by that same number at the end; whereas other split_huge_pagers
>>> (migration?) decrement one by one as they go through the list afterwards.
>>>
>>> I've run out of time (I'm about to take a break): I gave up researching
>>> who needs what, and was already feeling this optimization does too much
>>> second guessing of what's needed (and its array of VM_WARN_ON_ONCE_FOLIOs
>>> rather admits to that).
>>>
>>> And I don't think it's as simple as moving the node_stat_sub_folio()
>>> into 2/6 where the zero pte is substituted: that would probably handle
>>> the vast majority of cases, but aren't there others which pass the
>>> folio_ref_freeze(new_folio, 2) test - the title's zapped tail pages,
>>> or racily truncated now that the folio has been unlocked, for example?
>>>
>>> Hugh