On 07/03/2024 14:05, Matthew Wilcox wrote: > On Thu, Mar 07, 2024 at 09:50:09PM +0800, Yin, Fengwei wrote: >> >> >> On 3/7/2024 4:56 PM, wrote: >>> I just want to make sure I've understood correctly: CPU1's folio_put() >>> is not the last reference, and it keeps iterating through the local >>> list. Then CPU2 does the final folio_put() which causes list_del_init() >>> to modify the local list concurrently with CPU1's iteration, so CPU1 >>> probably goes into the weeds? >> >> My understanding is this can not corrupt the folio->deferred_list as >> this folio was iterated already. > > I am not convinced about that at all. It's possible this isn't the only > problem, but deleting something from a list without holding (the correct) > lock is something you have to think incredibly hard about to get right. > I didn't bother going any deeper into the analysis once I spotted the > locking problem, but the proof is very much on you that this is not a bug! > >> But I did see other strange thing: >> [ 76.269942] page: refcount:0 mapcount:1 mapping:0000000000000000 >> index:0xffffbd0a0 pfn:0x2554a0 >> [ 76.270483] note: kcompactd0[62] exited with preempt_count 1 >> [ 76.271344] head: order:0 entire_mapcount:1 nr_pages_mapped:0 pincount:0 >> >> This large folio has order 0? Maybe folio->_flags_1 was screwed? >> >> In free_unref_folios(), there is code like following: >> if (order > 0 && folio_test_large_rmappable(folio)) >> folio_undo_large_rmappable(folio); >> >> But with destroy_large_folio(): >> if (folio_test_large_rmappable(folio)) >> >> folio_undo_large_rmappable(folio); >> >> Can it connect to the folio has zero refcount still in deferred list >> with Matthew's patch? >> >> >> Looks like folio order was cleared unexpected somewhere. I think there could be something to this... I have a set up where, when running with Matthew's deferred split fix AND have commit 31b2ff82aefb "mm: handle large folios in free_unref_folios()" REVERTED, everything works as expected. And at the end, I have the expected amount of memory free (seen in meminfo and buddyinfo). But if I run only with the deferred split fix and DO NOT revert the other change, everything grinds to a halt when swapping 2M pages. Sometimes with RCU stalls where I can't even interact on the serial port. Sometimes (more usually) everything just gets stuck trying to reclaim and allocate memory. And when I kill the jobs, I still have barely any memory in the system - about 10% what I would expect. So is it possible that after commit 31b2ff82aefb "mm: handle large folios in free_unref_folios()", when freeing 2M folio back to the buddy, we are actually only telling it about the first 4K page? So we end up leaking the rest? > > No, we intentionally clear it: > > free_unref_folios -> free_unref_page_prepare -> free_pages_prepare -> > page[1].flags &= ~PAGE_FLAGS_SECOND; > > PAGE_FLAGS_SECOND includes the order, which is why we have to save it > away in folio->private so that we know what it is in the second loop. > So it's always been cleared by the time we call free_page_is_bad().