Re: [PATCH v3 10/18] mm: Allow non-hugetlb large folios to be batch processed

Matthew Wilcox <willy@xxxxxxxxxxxxx> · Sun, 10 Mar 2024 04:23:13 +0000

On Sat, Mar 09, 2024 at 09:38:42AM +0000, Ryan Roberts wrote:
> > I think split_queue_len is getting out of sync with the number of items on the
> > queue? We only decrement it if we lost the race with folio_put(). But we are
> > unconditionally taking folios off the list here. So we are definitely out of
> > sync until we take the lock again below. But we only put folios back on the list
> > that failed to split. A successful split used to decrement this variable
> > (because the folio was on _a_ list). But now it doesn't. So we are always
> > mismatched after the first failed split?
> 
> Oops, I meant first *sucessful* split.

Agreed, nice fix.

> I've run the full test 5 times, and haven't seen any slow down or RCU stall
> warning. But on the 5th time, I saw the non-NULL mapping oops (your new check
> did not trigger):
> 
> [  944.475632] BUG: Bad page state in process usemem  pfn:252932
> [  944.477314] page:00000000ad4feba6 refcount:0 mapcount:0
> mapping:000000003a777cd9 index:0x1 pfn:0x252932
> [  944.478575] aops:0x0 ino:dead000000000122
> [  944.479130] flags: 0xbfffc0000000000(node=0|zone=2|lastcpupid=0xffff)
> [  944.479934] page_type: 0xffffffff()
> [  944.480328] raw: 0bfffc0000000000 0000000000000000 fffffc00084a4c90
> fffffc00084a4c90
> [  944.481734] raw: 0000000000000001 0000000000000000 00000000ffffffff
> 0000000000000000
> [  944.482475] page dumped because: non-NULL mapping

> So what do we know?
> 
>  - the above page looks like it was the 3rd page of a large folio
>     - words 3 and 4 are the same, meaning they are likely empty _deferred_list
>     - pfn alignment is correct for this
>  - The _deferred_list for all previously freed large folios was empty
>     - but the folio could have been in the new deferred split batch?

I don't think it could be in a deferred split bacth because we hold the
refcount at that point ...

>  - free_tail_page_prepare() zeroed mapping/_deferred_list during free
>  - _deferred_list was subsequently reinitialized to "empty" while on free list
> 
> So how about this for a rough hypothesis:
> 
> 
> CPU1                                  CPU2
> deferred_split_scan
> list_del_init
> folio_batch_add
>                                       folio_put -> free
> 		                        free_tail_page_prepare
> 			                  is on deferred list? -> no
> split_huge_page_to_list_to_order
>   list_empty(folio->_deferred_list)
>     -> yes
>   list_del_init
> 			                  mapping = NULL
> 					    -> (_deferred_list.prev = NULL)
> 			                put page on free list
>     INIT_LIST_HEAD(entry);
>       -> "mapping" no longer NULL
> 
> 
> But CPU1 is holding a reference, so that could only happen if a reference was
> put one too many times. Ugh.

Before we start blaming the CPU for doing something impossible, what if
we're taking the wrong lock?  I know that seems crazy, but if page->flags
gets corrupted to the point where we change some of the bits in the
nid, when we free the folio, we call folio_undo_large_rmappable(),
get the wrong ds_queue back from get_deferred_split_queue(), take the
wrong split_queue_lock, corrupt the deferred list of a different node,
and bad things happen?

I don't think we can detect that folio->nid has become corrupted in the
page allocation/freeing code (can we?), but we can tell if a folio is
on the wrong ds_queue in deferred_split_scan():

        list_for_each_entry_safe(folio, next, &ds_queue->split_queue,
                                                        _deferred_list) {
+		VM_BUG_ON_FOLIO(folio_nid(folio) != sc->nid, folio);
+		VM_BUG_ON_FOLIO(folio_order(folio) < 2, folio);
                list_del_init(&folio->_deferred_list);

(also testing the hypothesis that somehow a split folio has ended up
on the deferred split list)

This wouldn't catch the splat above early, I don't think, but it might
trigger early enough with your workload that it'd be useful information.

(I reviewed the patch you're currently testing with and it matches with
what I think we should be doing)