On Wed, Mar 06, 2024 at 07:55:50PM +0000, Matthew Wilcox wrote: > Hang on, I think I see it. It is a race between folio freeing and > deferred_split_scan(), but page migration is absolved. Look: > > CPU 1: deferred_split_scan: > spin_lock_irqsave(split_queue_lock) > list_for_each_entry_safe() > folio_try_get() > list_move(&folio->_deferred_list, &list); > spin_unlock_irqrestore(split_queue_lock) > list_for_each_entry_safe() { > folio_trylock() <- fails > folio_put(folio); > > CPU 2: folio_put: > folio_undo_large_rmappable > ds_queue = get_deferred_split_queue(folio); > spin_lock_irqsave(&ds_queue->split_queue_lock, flags); > list_del_init(&folio->_deferred_list); > *** at this point CPU 1 is not holding the split_queue_lock; the > folio is on the local list. Which we just corrupted *** > > Now anything can happen. It's a pretty tight race that involves at > least two CPUs (CPU 2 might have been the one to have the folio locked > at the time CPU 1 caalled folio_trylock()). But I definitely widened > the window by moving the decrement of the refcount and the removal from > the deferred list further apart. > > > OK, so what's the solution here? Personally I favour using a > folio_batch in deferred_split_scan() to hold the folios that we're > going to try to remove instead of a linked list. Other ideas that are > perhaps less intrusive? I looked at a few options, but I think we need to keep the refcount elevated until we've got the folios back on the deferred split list. And we can't call folio_put() while holding the split_queue_lock or we'll deadlock. So we need to maintain a list of folios that isn't linked through deferred_list. Anyway, this is basically untested, except that it compiles. Opinions? Better patches? diff --git a/mm/huge_memory.c b/mm/huge_memory.c index fd745bcc97ff..0120a47ea7a1 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -3312,7 +3312,7 @@ static unsigned long deferred_split_scan(struct shrinker *shrink, struct pglist_data *pgdata = NODE_DATA(sc->nid); struct deferred_split *ds_queue = &pgdata->deferred_split_queue; unsigned long flags; - LIST_HEAD(list); + struct folio_batch batch; struct folio *folio, *next; int split = 0; @@ -3321,37 +3321,41 @@ static unsigned long deferred_split_scan(struct shrinker *shrink, ds_queue = &sc->memcg->deferred_split_queue; #endif + folio_batch_init(&batch); spin_lock_irqsave(&ds_queue->split_queue_lock, flags); /* Take pin on all head pages to avoid freeing them under us */ list_for_each_entry_safe(folio, next, &ds_queue->split_queue, _deferred_list) { - if (folio_try_get(folio)) { - list_move(&folio->_deferred_list, &list); - } else { - /* We lost race with folio_put() */ - list_del_init(&folio->_deferred_list); - ds_queue->split_queue_len--; + if (!folio_try_get(folio)) + continue; + if (!folio_trylock(folio)) + continue; + list_del_init(&folio->_deferred_list); + if (folio_batch_add(&batch, folio) == 0) { + --sc->nr_to_scan; + break; } if (!--sc->nr_to_scan) break; } spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags); - list_for_each_entry_safe(folio, next, &list, _deferred_list) { - if (!folio_trylock(folio)) - goto next; - /* split_huge_page() removes page from list on success */ + while ((folio = folio_batch_next(&batch)) != NULL) { if (!split_folio(folio)) split++; folio_unlock(folio); -next: - folio_put(folio); } spin_lock_irqsave(&ds_queue->split_queue_lock, flags); - list_splice_tail(&list, &ds_queue->split_queue); + while ((folio = folio_batch_next(&batch)) != NULL) { + if (!folio_test_large(folio)) + continue; + list_add_tail(&folio->_deferred_list, &ds_queue->split_queue); + } spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags); + folios_put(&batch); + /* * Stop shrinker if we didn't split any page, but the queue is empty. * This can happen if pages were freed under us.