[...] >>> >>> If so, we need to take the folios off the list (or otherwise mark them) >>> so that they can't be processed by more than one CPU at a time. And >>> that leads me to this patch (yes, folio_prep_large_rmappable() is >>> now vestigial, but removing it increases the churn a bit much for this >>> stage of debugging) >> >> Looks sensible on first review. I'll do some testing now to see if I can >> re-triger the non-NULL mapping issue. Will get back to you in the next couple of >> hours. >> [...] >>> - if (folio_try_get(folio)) { >>> - list_move(&folio->_deferred_list, &list); >>> - } else { >>> + list_del_init(&folio->_deferred_list); >>> + sc->nr_to_scan--; >>> + if (!folio_try_get(folio)) { >>> /* We lost race with folio_put() */ >>> - list_del_init(&folio->_deferred_list); >>> ds_queue->split_queue_len--; > > I think split_queue_len is getting out of sync with the number of items on the > queue? We only decrement it if we lost the race with folio_put(). But we are > unconditionally taking folios off the list here. So we are definitely out of > sync until we take the lock again below. But we only put folios back on the list > that failed to split. A successful split used to decrement this variable > (because the folio was on _a_ list). But now it doesn't. So we are always > mismatched after the first failed split? Oops, I meant first *sucessful* split. > > I'll fix this up before I start testing. > I've run the full test 5 times, and haven't seen any slow down or RCU stall warning. But on the 5th time, I saw the non-NULL mapping oops (your new check did not trigger): [ 944.475632] BUG: Bad page state in process usemem pfn:252932 [ 944.477314] page:00000000ad4feba6 refcount:0 mapcount:0 mapping:000000003a777cd9 index:0x1 pfn:0x252932 [ 944.478575] aops:0x0 ino:dead000000000122 [ 944.479130] flags: 0xbfffc0000000000(node=0|zone=2|lastcpupid=0xffff) [ 944.479934] page_type: 0xffffffff() [ 944.480328] raw: 0bfffc0000000000 0000000000000000 fffffc00084a4c90 fffffc00084a4c90 [ 944.481734] raw: 0000000000000001 0000000000000000 00000000ffffffff 0000000000000000 [ 944.482475] page dumped because: non-NULL mapping [ 944.482938] Modules linked in: [ 944.483238] CPU: 9 PID: 3805 Comm: usemem Not tainted 6.8.0-rc5-00391-g44b0dc848590 #39 [ 944.484308] Hardware name: linux,dummy-virt (DT) [ 944.484981] Call trace: [ 944.485315] dump_backtrace+0x9c/0x100 [ 944.485840] show_stack+0x20/0x38 [ 944.486300] dump_stack_lvl+0x90/0xb0 [ 944.486833] dump_stack+0x18/0x28 [ 944.487178] bad_page+0x88/0x128 [ 944.487523] __rmqueue_pcplist+0x24c/0xa18 [ 944.488001] get_page_from_freelist+0x278/0x1288 [ 944.488579] __alloc_pages+0xec/0x1018 [ 944.489097] alloc_pages_mpol+0x11c/0x278 [ 944.489455] vma_alloc_folio+0x70/0xd0 [ 944.489988] do_huge_pmd_anonymous_page+0xb8/0xda8 [ 944.490819] __handle_mm_fault+0xd00/0x1a28 [ 944.491497] handle_mm_fault+0x7c/0x418 [ 944.491978] do_page_fault+0x100/0x690 [ 944.492357] do_translation_fault+0xb4/0xd0 [ 944.492968] do_mem_abort+0x4c/0xa8 [ 944.493353] el0_da+0x54/0xb8 [ 944.493818] el0t_64_sync_handler+0xe4/0x158 [ 944.494581] el0t_64_sync+0x190/0x198 [ 944.495218] Disabling lock debugging due to kernel taint So what do we know? - the above page looks like it was the 3rd page of a large folio - words 3 and 4 are the same, meaning they are likely empty _deferred_list - pfn alignment is correct for this - The _deferred_list for all previously freed large folios was empty - but the folio could have been in the new deferred split batch? - free_tail_page_prepare() zeroed mapping/_deferred_list during free - _deferred_list was subsequently reinitialized to "empty" while on free list So how about this for a rough hypothesis: CPU1 CPU2 deferred_split_scan list_del_init folio_batch_add folio_put -> free free_tail_page_prepare is on deferred list? -> no split_huge_page_to_list_to_order list_empty(folio->_deferred_list) -> yes list_del_init mapping = NULL -> (_deferred_list.prev = NULL) put page on free list INIT_LIST_HEAD(entry); -> "mapping" no longer NULL But CPU1 is holding a reference, so that could only happen if a reference was put one too many times. Ugh. FYI, this is the fixed-up fix patch I have: diff --git a/mm/huge_memory.c b/mm/huge_memory.c index fd745bcc97ff..fde016451e1a 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -792,8 +792,6 @@ void folio_prep_large_rmappable(struct folio *folio) { if (!folio || !folio_test_large(folio)) return; - if (folio_order(folio) > 1) - INIT_LIST_HEAD(&folio->_deferred_list); folio_set_large_rmappable(folio); } @@ -3312,7 +3310,7 @@ static unsigned long deferred_split_scan(struct shrinker *shrink, struct pglist_data *pgdata = NODE_DATA(sc->nid); struct deferred_split *ds_queue = &pgdata->deferred_split_queue; unsigned long flags; - LIST_HEAD(list); + struct folio_batch batch; struct folio *folio, *next; int split = 0; @@ -3321,36 +3319,39 @@ static unsigned long deferred_split_scan(struct shrinker *shrink, ds_queue = &sc->memcg->deferred_split_queue; #endif + folio_batch_init(&batch); spin_lock_irqsave(&ds_queue->split_queue_lock, flags); - /* Take pin on all head pages to avoid freeing them under us */ + /* Take ref on all folios to avoid freeing them under us */ list_for_each_entry_safe(folio, next, &ds_queue->split_queue, _deferred_list) { - if (folio_try_get(folio)) { - list_move(&folio->_deferred_list, &list); - } else { - /* We lost race with folio_put() */ - list_del_init(&folio->_deferred_list); - ds_queue->split_queue_len--; - } - if (!--sc->nr_to_scan) + list_del_init(&folio->_deferred_list); + ds_queue->split_queue_len--; + sc->nr_to_scan--; + if (folio_try_get(folio) && + folio_batch_add(&batch, folio) == 0) + break; + if (!sc->nr_to_scan) break; } spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags); - list_for_each_entry_safe(folio, next, &list, _deferred_list) { + while ((folio = folio_batch_next(&batch)) != NULL) { if (!folio_trylock(folio)) - goto next; - /* split_huge_page() removes page from list on success */ + continue; if (!split_folio(folio)) split++; folio_unlock(folio); -next: - folio_put(folio); } spin_lock_irqsave(&ds_queue->split_queue_lock, flags); - list_splice_tail(&list, &ds_queue->split_queue); + while ((folio = folio_batch_next(&batch)) != NULL) { + if (!folio_test_large(folio)) + continue; + list_add_tail(&folio->_deferred_list, &ds_queue->split_queue); + ds_queue->split_queue_len++; + } spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags); + folios_put(&batch); /* * Stop shrinker if we didn't split any page, but the queue is empty. diff --git a/mm/internal.h b/mm/internal.h index 1dfdc3bde1b0..14c21d06f233 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -432,6 +432,8 @@ static inline void prep_compound_head(struct page *page, unsigned int order) atomic_set(&folio->_entire_mapcount, -1); atomic_set(&folio->_nr_pages_mapped, 0); atomic_set(&folio->_pincount, 0); + if (order > 1) + INIT_LIST_HEAD(&folio->_deferred_list); } static inline void prep_compound_tail(struct page *head, int tail_idx) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 025ad1a7df7b..fc9c7ca24c4c 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1007,9 +1007,12 @@ static int free_tail_page_prepare(struct page *head_page, struct page *page) break; case 2: /* - * the second tail page: ->mapping is - * deferred_list.next -- ignore value. + * the second tail page: ->mapping is deferred_list.next */ + if (unlikely(!list_empty(&folio->_deferred_list))) { + bad_page(page, "still on deferred list"); + goto out; + } break; default: if (page->mapping != TAIL_MAPPING) {