On 08/03/2024 11:44, Ryan Roberts wrote: >> The thought occurs that we don't need to take the folios off the list. >> I don't know that will fix anything, but this will fix your "running out >> of memory" problem -- I forgot to drop the reference if folio_trylock() >> failed. Of course, I can't call folio_put() inside the lock, so may >> as well move the trylock back to the second loop. >> >> Again, compile-tessted only. >> >> diff --git a/mm/huge_memory.c b/mm/huge_memory.c >> index fd745bcc97ff..4a2ab17f802d 100644 >> --- a/mm/huge_memory.c >> +++ b/mm/huge_memory.c >> @@ -3312,7 +3312,7 @@ static unsigned long deferred_split_scan(struct shrinker *shrink, >> struct pglist_data *pgdata = NODE_DATA(sc->nid); >> struct deferred_split *ds_queue = &pgdata->deferred_split_queue; >> unsigned long flags; >> - LIST_HEAD(list); >> + struct folio_batch batch; >> struct folio *folio, *next; >> int split = 0; >> >> @@ -3321,36 +3321,31 @@ static unsigned long deferred_split_scan(struct shrinker *shrink, >> ds_queue = &sc->memcg->deferred_split_queue; >> #endif >> >> + folio_batch_init(&batch); >> spin_lock_irqsave(&ds_queue->split_queue_lock, flags); >> - /* Take pin on all head pages to avoid freeing them under us */ >> + /* Take ref on all folios to avoid freeing them under us */ >> list_for_each_entry_safe(folio, next, &ds_queue->split_queue, >> _deferred_list) { >> - if (folio_try_get(folio)) { >> - list_move(&folio->_deferred_list, &list); >> - } else { >> - /* We lost race with folio_put() */ >> - list_del_init(&folio->_deferred_list); >> - ds_queue->split_queue_len--; >> + if (!folio_try_get(folio)) >> + continue; >> + if (folio_batch_add(&batch, folio) == 0) { >> + --sc->nr_to_scan; >> + break; >> } >> if (!--sc->nr_to_scan) >> break; >> } >> spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags); >> >> - list_for_each_entry_safe(folio, next, &list, _deferred_list) { >> + while ((folio = folio_batch_next(&batch)) != NULL) { >> if (!folio_trylock(folio)) >> - goto next; >> - /* split_huge_page() removes page from list on success */ >> + continue; >> if (!split_folio(folio)) >> split++; >> folio_unlock(folio); >> -next: >> - folio_put(folio); >> } >> >> - spin_lock_irqsave(&ds_queue->split_queue_lock, flags); >> - list_splice_tail(&list, &ds_queue->split_queue); >> - spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags); >> + folios_put(&batch); >> >> /* >> * Stop shrinker if we didn't split any page, but the queue is empty. > > > OK I've tested this; the good news is that I haven't seen any oopses or memory > leaks. The bad news is that it still takes an absolute age (hours) to complete > the same test that without "mm: Allow non-hugetlb large folios to be batch > processed" took a couple of mins. And during that time, the system is completely > unresponsive - serial terminal doesn't work - can't even break in with sysreq. > And sometimes I see RCU stall warnings. > > Dumping all the CPU back traces with gdb, all the cores (except one) are > contending on the the deferred split lock. > > A couple of thoughts: > > - Since we are now taking a maximum of 15 folios into a batch, > deferred_split_scan() is called much more often (in a tight loop from > do_shrink_slab()). Could it be that we are just trying to take the lock so much > more often now? I don't think it's quite that simple because we take the lock > for every single folio when adding it to the queue, so the dequeing cost should > still be a factor of 15 locks less. > > - do_shrink_slab() might be calling deferred_split_scan() in a tight loop with > deferred_split_scan() returning 0 most of the time. If there are still folios on > the deferred split list but deferred_split_scan() was unable to lock any folios > then it will return 0, not SHRINK_STOP, so do_shrink_slab() will keep calling > it, essentially live locking. Has your patch changed the duration of the folio > being locked? I don't think so... > > - Ahh, perhaps its as simple as your fix has removed the code that removed the > folio from the deferred split queue if it fails to get a reference? That could > mean we end up returning 0 instead of SHRINK_STOP too. I'll have play. > I tested the last idea by adding this back in: diff --git a/mm/huge_memory.c b/mm/huge_memory.c index d46897d7ea7f..50b07362923a 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -3327,8 +3327,12 @@ static unsigned long deferred_split_scan(struct shrinker *shrink, /* Take ref on all folios to avoid freeing them under us */ list_for_each_entry_safe(folio, next, &ds_queue->split_queue, _deferred_list) { - if (!folio_try_get(folio)) + if (!folio_try_get(folio)) { + /* We lost race with folio_put() */ + list_del_init(&folio->_deferred_list); + ds_queue->split_queue_len--; continue; + } if (folio_batch_add(&batch, folio) == 0) { --sc->nr_to_scan; break; The test now gets further than where it was previously getting live-locked, but I then get a new oops (this is just yesterday's mm-unstable with your fix v2 and the above change): [ 247.788985] BUG: Bad page state in process usemem pfn:ae58c2 [ 247.789617] page: refcount:0 mapcount:0 mapping:00000000dc16b680 index:0x1 pfn:0xae58c2 [ 247.790129] aops:0x0 ino:dead000000000122 [ 247.790394] flags: 0xbfffc0000000000(node=0|zone=2|lastcpupid=0xffff) [ 247.790821] page_type: 0xffffffff() [ 247.791052] raw: 0bfffc0000000000 0000000000000000 fffffc002a963090 fffffc002a963090 [ 247.791546] raw: 0000000000000001 0000000000000000 00000000ffffffff 0000000000000000 [ 247.792258] page dumped because: non-NULL mapping [ 247.792567] Modules linked in: [ 247.792772] CPU: 0 PID: 2052 Comm: usemem Not tainted 6.8.0-rc5-00456-g52fd6cd3bee5 #30 [ 247.793300] Hardware name: linux,dummy-virt (DT) [ 247.793680] Call trace: [ 247.793894] dump_backtrace+0x9c/0x100 [ 247.794200] show_stack+0x20/0x38 [ 247.794460] dump_stack_lvl+0x90/0xb0 [ 247.794726] dump_stack+0x18/0x28 [ 247.794964] bad_page+0x88/0x128 [ 247.795196] get_page_from_freelist+0xdc4/0x1280 [ 247.795520] __alloc_pages+0xe8/0x1038 [ 247.795781] alloc_pages_mpol+0x90/0x278 [ 247.796059] vma_alloc_folio+0x70/0xd0 [ 247.796320] __handle_mm_fault+0xc40/0x19a0 [ 247.796610] handle_mm_fault+0x7c/0x418 [ 247.796908] do_page_fault+0x100/0x690 [ 247.797231] do_translation_fault+0xb4/0xd0 [ 247.797584] do_mem_abort+0x4c/0xa8 [ 247.797874] el0_da+0x54/0xb8 [ 247.798123] el0t_64_sync_handler+0xe4/0x158 [ 247.798473] el0t_64_sync+0x190/0x198 [ 247.815597] Disabling lock debugging due to kernel taint And then into RCU stalls after that. I have seen a similar non-NULL mapping oops yesterday. But with the deferred split fix in place, I can now see this reliably. My sense is that the first deferred split issue is now fully resolved once the extra code above is reinserted, but we still have a second problem. Thoughts? Perhaps I can bisect this given it seems pretty reproducible. Thanks, Ryan