On 09/03/2024 06:09, Matthew Wilcox wrote: > On Fri, Mar 08, 2024 at 11:44:35AM +0000, Ryan Roberts wrote: >>> The thought occurs that we don't need to take the folios off the list. >>> I don't know that will fix anything, but this will fix your "running out >>> of memory" problem -- I forgot to drop the reference if folio_trylock() >>> failed. Of course, I can't call folio_put() inside the lock, so may >>> as well move the trylock back to the second loop. > > I think this was a bad thought ... The not-taking-folios-off-the-list thought? Yes, agreed. > >> Dumping all the CPU back traces with gdb, all the cores (except one) are >> contending on the the deferred split lock. > > I'm pretty sure that we can call the shrinker on multiple CPUs at the > same time (can you confirm from the backtrace?) Yes, the vast majority of the CPUs were in deferred_split_scan() waiting for the split_queue_lock. > > struct pglist_data *pgdata = NODE_DATA(sc->nid); > struct deferred_split *ds_queue = &pgdata->deferred_split_queue; > > so if two CPUs try to shrink the same node, they're going to try to > process the same set of folios. Which means the split will keep failing > because each of them will have a refcount on the folio, and ... yeah. Ahh, ouch. So this probably explains why things started going slow for me again last night. > > If so, we need to take the folios off the list (or otherwise mark them) > so that they can't be processed by more than one CPU at a time. And > that leads me to this patch (yes, folio_prep_large_rmappable() is > now vestigial, but removing it increases the churn a bit much for this > stage of debugging) Looks sensible on first review. I'll do some testing now to see if I can re-triger the non-NULL mapping issue. Will get back to you in the next couple of hours. > > This time I've boot-tested it. I'm running my usual test-suite against > it now with little expectation that it will trigger. If I have time > I'll try to recreate your setup. > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index fd745bcc97ff..2ca033a6c3d8 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -792,8 +792,6 @@ void folio_prep_large_rmappable(struct folio *folio) > { > if (!folio || !folio_test_large(folio)) > return; > - if (folio_order(folio) > 1) > - INIT_LIST_HEAD(&folio->_deferred_list); > folio_set_large_rmappable(folio); > } > > @@ -3312,7 +3310,7 @@ static unsigned long deferred_split_scan(struct shrinker *shrink, > struct pglist_data *pgdata = NODE_DATA(sc->nid); > struct deferred_split *ds_queue = &pgdata->deferred_split_queue; > unsigned long flags; > - LIST_HEAD(list); > + struct folio_batch batch; > struct folio *folio, *next; > int split = 0; > > @@ -3321,36 +3319,40 @@ static unsigned long deferred_split_scan(struct shrinker *shrink, > ds_queue = &sc->memcg->deferred_split_queue; > #endif > > + folio_batch_init(&batch); > spin_lock_irqsave(&ds_queue->split_queue_lock, flags); > - /* Take pin on all head pages to avoid freeing them under us */ > + /* Take ref on all folios to avoid freeing them under us */ > list_for_each_entry_safe(folio, next, &ds_queue->split_queue, > _deferred_list) { > - if (folio_try_get(folio)) { > - list_move(&folio->_deferred_list, &list); > - } else { > + list_del_init(&folio->_deferred_list); > + sc->nr_to_scan--; > + if (!folio_try_get(folio)) { > /* We lost race with folio_put() */ > - list_del_init(&folio->_deferred_list); > ds_queue->split_queue_len--; > + } else if (folio_batch_add(&batch, folio) == 0) { > + break; > } > - if (!--sc->nr_to_scan) > + if (!sc->nr_to_scan) > break; > } > spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags); > > - list_for_each_entry_safe(folio, next, &list, _deferred_list) { > + while ((folio = folio_batch_next(&batch)) != NULL) { > if (!folio_trylock(folio)) > - goto next; > - /* split_huge_page() removes page from list on success */ > + continue; > if (!split_folio(folio)) > split++; > folio_unlock(folio); > -next: > - folio_put(folio); > } > > spin_lock_irqsave(&ds_queue->split_queue_lock, flags); > - list_splice_tail(&list, &ds_queue->split_queue); > + while ((folio = folio_batch_next(&batch)) != NULL) { > + if (!folio_test_large(folio)) > + continue; > + list_add_tail(&folio->_deferred_list, &ds_queue->split_queue); > + } > spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags); > + folios_put(&batch); > > /* > * Stop shrinker if we didn't split any page, but the queue is empty. > diff --git a/mm/internal.h b/mm/internal.h > index 1dfdc3bde1b0..14c21d06f233 100644 > --- a/mm/internal.h > +++ b/mm/internal.h > @@ -432,6 +432,8 @@ static inline void prep_compound_head(struct page *page, unsigned int order) > atomic_set(&folio->_entire_mapcount, -1); > atomic_set(&folio->_nr_pages_mapped, 0); > atomic_set(&folio->_pincount, 0); > + if (order > 1) > + INIT_LIST_HEAD(&folio->_deferred_list); > } > > static inline void prep_compound_tail(struct page *head, int tail_idx) > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 025ad1a7df7b..fc9c7ca24c4c 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -1007,9 +1007,12 @@ static int free_tail_page_prepare(struct page *head_page, struct page *page) > break; > case 2: > /* > - * the second tail page: ->mapping is > - * deferred_list.next -- ignore value. > + * the second tail page: ->mapping is deferred_list.next > */ > + if (unlikely(!list_empty(&folio->_deferred_list))) { > + bad_page(page, "still on deferred list"); > + goto out; > + } > break; > default: > if (page->mapping != TAIL_MAPPING) { >