On 06/03/2024 16:09, Matthew Wilcox wrote: > On Wed, Mar 06, 2024 at 01:42:06PM +0000, Ryan Roberts wrote: >> When running some swap tests with this change (which is in mm-stable) >> present, I see BadThings(TM). Usually I see a "bad page state" >> followed by a delay of a few seconds, followed by an oops or NULL >> pointer deref. Bisect points to this change, and if I revert it, >> the problem goes away. > > That oops is really messed up ;-( We're clearly got two CPUs oopsing at > the same time and it's all interleaved. That said, I can pick some > nuggets out of it. > >> [ 76.239466] BUG: Bad page state in process usemem pfn:2554a0 >> [ 76.240196] kernel BUG at include/linux/mm.h:1120! > > These are the two different BUGs being called simultaneously ... > > The first one is bad_page() in page_alloc.c and the second is > put_page_testzero() > VM_BUG_ON_PAGE(page_ref_count(page) == 0, page); > > I'm sure it's significant that both of these are the same page (pfn > 2554a0). Feels like we have two CPUs calling put_folio() at the same > time, and one of them underflows. It probably doesn't matter which call > trace ends up in bad_page() and which in put_page_testzero(). > > One of them is coming from deferred_split_scan(), which is weird because > we can see the folio_try_get() earlier in the function. So whatever > this folio was, we found it on the deferred split list, got its refcount, > moved it to the local list, either failed to get the lock, or > successfully got the lock, split it, unlocked it and put it. > > (I can see this was invoked from page fault -> memcg shrinking. That's > probably irrelevant but explains some of the functions in the backtrace) > > The other call trace comes from migrate_folio_done() where we're putting > the _source_ folio. That was called from migrate_pages_batch() which > was called from kcompactd. > > Um. Where do we handle the deferred list in the migration code? > > > I've also tried looking at this from a different angle -- what is it > about this commit that produces this problem? It's a fairly small > commit: > > - if (folio_test_large(folio)) { > + /* hugetlb has its own memcg */ > + if (folio_test_hugetlb(folio)) { > if (lruvec) { > unlock_page_lruvec_irqrestore(lruvec, flags); > lruvec = NULL; > } > - __folio_put_large(folio); > + free_huge_folio(folio); > > So all that's changed is that large non-hugetlb folios do not call > __folio_put_large(). As a reminder, that function does: > > if (!folio_test_hugetlb(folio)) > page_cache_release(folio); > destroy_large_folio(folio); > > and destroy_large_folio() does: > if (folio_test_large_rmappable(folio)) > folio_undo_large_rmappable(folio); > > mem_cgroup_uncharge(folio); > free_the_page(&folio->page, folio_order(folio)); > > So after my patch, instead of calling (in order): > > page_cache_release(folio); > folio_undo_large_rmappable(folio); > mem_cgroup_uncharge(folio); > free_unref_page() > > it calls: > > __page_cache_release(folio, &lruvec, &flags); > mem_cgroup_uncharge_folios() > folio_undo_large_rmappable(folio); I was just looking at this again, and something pops out... You have swapped the order of folio_undo_large_rmappable() and mem_cgroup_uncharge(). But folio_undo_large_rmappable() calls get_deferred_split_queue() which tries to get the split queue from folio_memcg(folio) first and falls back to pgdat otherwise. If you are now calling mem_cgroup_uncharge_folios() first, will that remove the folio from the cgroup? Then we are operating on the wrong list? (just a guess based on the name of the function...) > > So have I simply widened the window for this race, whatever it is > exactly? Something involving mis-handling of the deferred list? >