Re:

Ryan Roberts <ryan.roberts@xxxxxxx> · Thu, 7 Mar 2024 16:24:43 +0000

On 07/03/2024 15:24, Ryan Roberts wrote:
> On 07/03/2024 14:05, Matthew Wilcox wrote:
>> On Thu, Mar 07, 2024 at 09:50:09PM +0800, Yin, Fengwei wrote:
>>>
>>>
>>> On 3/7/2024 4:56 PM,  wrote:
>>>> I just want to make sure I've understood correctly: CPU1's folio_put()
>>>> is not the last reference, and it keeps iterating through the local
>>>> list. Then CPU2 does the final folio_put() which causes list_del_init()
>>>> to modify the local list concurrently with CPU1's iteration, so CPU1
>>>> probably goes into the weeds?
>>>
>>> My understanding is this can not corrupt the folio->deferred_list as
>>> this folio was iterated already.
>>
>> I am not convinced about that at all.  It's possible this isn't the only
>> problem, but deleting something from a list without holding (the correct)
>> lock is something you have to think incredibly hard about to get right.
>> I didn't bother going any deeper into the analysis once I spotted the
>> locking problem, but the proof is very much on you that this is not a bug!
>>
>>> But I did see other strange thing:
>>> [   76.269942] page: refcount:0 mapcount:1 mapping:0000000000000000
>>> index:0xffffbd0a0 pfn:0x2554a0
>>> [   76.270483] note: kcompactd0[62] exited with preempt_count 1
>>> [   76.271344] head: order:0 entire_mapcount:1 nr_pages_mapped:0 pincount:0
>>>
>>> This large folio has order 0? Maybe folio->_flags_1 was screwed?
>>>
>>> In free_unref_folios(), there is code like following:
>>>                 if (order > 0 && folio_test_large_rmappable(folio))
>>>                         folio_undo_large_rmappable(folio);
>>>
>>> But with destroy_large_folio():
>>>         if (folio_test_large_rmappable(folio))
>>>
>>> 			folio_undo_large_rmappable(folio);
>>>
>>> Can it connect to the folio has zero refcount still in deferred list
>>> with Matthew's patch?
>>>
>>>
>>> Looks like folio order was cleared unexpected somewhere.
> 
> I think there could be something to this...
> 
> I have a set up where, when running with Matthew's deferred split fix AND have
> commit 31b2ff82aefb "mm: handle large folios in free_unref_folios()" REVERTED,
> everything works as expected. And at the end, I have the expected amount of
> memory free (seen in meminfo and buddyinfo).
> 
> But if I run only with the deferred split fix and DO NOT revert the other
> change, everything grinds to a halt when swapping 2M pages. Sometimes with RCU
> stalls where I can't even interact on the serial port. Sometimes (more usually)
> everything just gets stuck trying to reclaim and allocate memory. And when I
> kill the jobs, I still have barely any memory in the system - about 10% what I
> would expect.
> 
> So is it possible that after commit 31b2ff82aefb "mm: handle large folios in
> free_unref_folios()", when freeing 2M folio back to the buddy, we are actually
> only telling it about the first 4K page? So we end up leaking the rest?

I notice that before the commit, large folios are uncharged with
__mem_cgroup_uncharge() and now they use __mem_cgroup_uncharge_folios().

The former has an upfront check:

	if (!folio_memcg(folio))
		return;

I'm not exactly sure what that's checking but could the fact this is missing
after the change cause things to go wonky?

> 
>>
>> No, we intentionally clear it:
>>
>> free_unref_folios -> free_unref_page_prepare -> free_pages_prepare ->
>> page[1].flags &= ~PAGE_FLAGS_SECOND;
>>
>> PAGE_FLAGS_SECOND includes the order, which is why we have to save it
>> away in folio->private so that we know what it is in the second loop.
>> So it's always been cleared by the time we call free_page_is_bad().
>