On Sat, Mar 09, 2024 at 09:38:42AM +0000, Ryan Roberts wrote: > > I think split_queue_len is getting out of sync with the number of items on the > > queue? We only decrement it if we lost the race with folio_put(). But we are > > unconditionally taking folios off the list here. So we are definitely out of > > sync until we take the lock again below. But we only put folios back on the list > > that failed to split. A successful split used to decrement this variable > > (because the folio was on _a_ list). But now it doesn't. So we are always > > mismatched after the first failed split? > > Oops, I meant first *sucessful* split. Agreed, nice fix. > I've run the full test 5 times, and haven't seen any slow down or RCU stall > warning. But on the 5th time, I saw the non-NULL mapping oops (your new check > did not trigger): > > [ 944.475632] BUG: Bad page state in process usemem pfn:252932 > [ 944.477314] page:00000000ad4feba6 refcount:0 mapcount:0 > mapping:000000003a777cd9 index:0x1 pfn:0x252932 > [ 944.478575] aops:0x0 ino:dead000000000122 > [ 944.479130] flags: 0xbfffc0000000000(node=0|zone=2|lastcpupid=0xffff) > [ 944.479934] page_type: 0xffffffff() > [ 944.480328] raw: 0bfffc0000000000 0000000000000000 fffffc00084a4c90 > fffffc00084a4c90 > [ 944.481734] raw: 0000000000000001 0000000000000000 00000000ffffffff > 0000000000000000 > [ 944.482475] page dumped because: non-NULL mapping > So what do we know? > > - the above page looks like it was the 3rd page of a large folio > - words 3 and 4 are the same, meaning they are likely empty _deferred_list > - pfn alignment is correct for this > - The _deferred_list for all previously freed large folios was empty > - but the folio could have been in the new deferred split batch? I don't think it could be in a deferred split bacth because we hold the refcount at that point ... > - free_tail_page_prepare() zeroed mapping/_deferred_list during free > - _deferred_list was subsequently reinitialized to "empty" while on free list > > So how about this for a rough hypothesis: > > > CPU1 CPU2 > deferred_split_scan > list_del_init > folio_batch_add > folio_put -> free > free_tail_page_prepare > is on deferred list? -> no > split_huge_page_to_list_to_order > list_empty(folio->_deferred_list) > -> yes > list_del_init > mapping = NULL > -> (_deferred_list.prev = NULL) > put page on free list > INIT_LIST_HEAD(entry); > -> "mapping" no longer NULL > > > But CPU1 is holding a reference, so that could only happen if a reference was > put one too many times. Ugh. Before we start blaming the CPU for doing something impossible, what if we're taking the wrong lock? I know that seems crazy, but if page->flags gets corrupted to the point where we change some of the bits in the nid, when we free the folio, we call folio_undo_large_rmappable(), get the wrong ds_queue back from get_deferred_split_queue(), take the wrong split_queue_lock, corrupt the deferred list of a different node, and bad things happen? I don't think we can detect that folio->nid has become corrupted in the page allocation/freeing code (can we?), but we can tell if a folio is on the wrong ds_queue in deferred_split_scan(): list_for_each_entry_safe(folio, next, &ds_queue->split_queue, _deferred_list) { + VM_BUG_ON_FOLIO(folio_nid(folio) != sc->nid, folio); + VM_BUG_ON_FOLIO(folio_order(folio) < 2, folio); list_del_init(&folio->_deferred_list); (also testing the hypothesis that somehow a split folio has ended up on the deferred split list) This wouldn't catch the splat above early, I don't think, but it might trigger early enough with your workload that it'd be useful information. (I reviewed the patch you're currently testing with and it matches with what I think we should be doing)