On Wed, Mar 06, 2024 at 01:42:06PM +0000, Ryan Roberts wrote: > When running some swap tests with this change (which is in mm-stable) > present, I see BadThings(TM). Usually I see a "bad page state" > followed by a delay of a few seconds, followed by an oops or NULL > pointer deref. Bisect points to this change, and if I revert it, > the problem goes away. That oops is really messed up ;-( We're clearly got two CPUs oopsing at the same time and it's all interleaved. That said, I can pick some nuggets out of it. > [ 76.239466] BUG: Bad page state in process usemem pfn:2554a0 > [ 76.240196] kernel BUG at include/linux/mm.h:1120! These are the two different BUGs being called simultaneously ... The first one is bad_page() in page_alloc.c and the second is put_page_testzero() VM_BUG_ON_PAGE(page_ref_count(page) == 0, page); I'm sure it's significant that both of these are the same page (pfn 2554a0). Feels like we have two CPUs calling put_folio() at the same time, and one of them underflows. It probably doesn't matter which call trace ends up in bad_page() and which in put_page_testzero(). One of them is coming from deferred_split_scan(), which is weird because we can see the folio_try_get() earlier in the function. So whatever this folio was, we found it on the deferred split list, got its refcount, moved it to the local list, either failed to get the lock, or successfully got the lock, split it, unlocked it and put it. (I can see this was invoked from page fault -> memcg shrinking. That's probably irrelevant but explains some of the functions in the backtrace) The other call trace comes from migrate_folio_done() where we're putting the _source_ folio. That was called from migrate_pages_batch() which was called from kcompactd. Um. Where do we handle the deferred list in the migration code? I've also tried looking at this from a different angle -- what is it about this commit that produces this problem? It's a fairly small commit: - if (folio_test_large(folio)) { + /* hugetlb has its own memcg */ + if (folio_test_hugetlb(folio)) { if (lruvec) { unlock_page_lruvec_irqrestore(lruvec, flags); lruvec = NULL; } - __folio_put_large(folio); + free_huge_folio(folio); So all that's changed is that large non-hugetlb folios do not call __folio_put_large(). As a reminder, that function does: if (!folio_test_hugetlb(folio)) page_cache_release(folio); destroy_large_folio(folio); and destroy_large_folio() does: if (folio_test_large_rmappable(folio)) folio_undo_large_rmappable(folio); mem_cgroup_uncharge(folio); free_the_page(&folio->page, folio_order(folio)); So after my patch, instead of calling (in order): page_cache_release(folio); folio_undo_large_rmappable(folio); mem_cgroup_uncharge(folio); free_unref_page() it calls: __page_cache_release(folio, &lruvec, &flags); mem_cgroup_uncharge_folios() folio_undo_large_rmappable(folio); So have I simply widened the window for this race, whatever it is exactly? Something involving mis-handling of the deferred list?