Re: [PATCH v3 10/18] mm: Allow non-hugetlb large folios to be batch processed

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Mar 06, 2024 at 07:55:50PM +0000, Matthew Wilcox wrote:
> Hang on, I think I see it.  It is a race between folio freeing and
> deferred_split_scan(), but page migration is absolved.  Look:
> 
> CPU 1: deferred_split_scan:
> spin_lock_irqsave(split_queue_lock)
> list_for_each_entry_safe()
> folio_try_get()
> list_move(&folio->_deferred_list, &list);
> spin_unlock_irqrestore(split_queue_lock)
> list_for_each_entry_safe() {
> 	folio_trylock() <- fails
> 	folio_put(folio);
> 
> CPU 2: folio_put:
> folio_undo_large_rmappable
>         ds_queue = get_deferred_split_queue(folio);
>         spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
>                 list_del_init(&folio->_deferred_list);
> *** at this point CPU 1 is not holding the split_queue_lock; the
> folio is on the local list.  Which we just corrupted ***
> 
> Now anything can happen.  It's a pretty tight race that involves at
> least two CPUs (CPU 2 might have been the one to have the folio locked
> at the time CPU 1 caalled folio_trylock()).  But I definitely widened
> the window by moving the decrement of the refcount and the removal from
> the deferred list further apart.
> 
> 
> OK, so what's the solution here?  Personally I favour using a
> folio_batch in deferred_split_scan() to hold the folios that we're
> going to try to remove instead of a linked list.  Other ideas that are
> perhaps less intrusive?

I looked at a few options, but I think we need to keep the refcount
elevated until we've got the folios back on the deferred split list.
And we can't call folio_put() while holding the split_queue_lock or
we'll deadlock.  So we need to maintain a list of folios that isn't
linked through deferred_list.  Anyway, this is basically untested,
except that it compiles.

Opinions?  Better patches?

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index fd745bcc97ff..0120a47ea7a1 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3312,7 +3312,7 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
 	struct pglist_data *pgdata = NODE_DATA(sc->nid);
 	struct deferred_split *ds_queue = &pgdata->deferred_split_queue;
 	unsigned long flags;
-	LIST_HEAD(list);
+	struct folio_batch batch;
 	struct folio *folio, *next;
 	int split = 0;
 
@@ -3321,37 +3321,41 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
 		ds_queue = &sc->memcg->deferred_split_queue;
 #endif
 
+	folio_batch_init(&batch);
 	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
 	/* Take pin on all head pages to avoid freeing them under us */
 	list_for_each_entry_safe(folio, next, &ds_queue->split_queue,
 							_deferred_list) {
-		if (folio_try_get(folio)) {
-			list_move(&folio->_deferred_list, &list);
-		} else {
-			/* We lost race with folio_put() */
-			list_del_init(&folio->_deferred_list);
-			ds_queue->split_queue_len--;
+		if (!folio_try_get(folio))
+			continue;
+		if (!folio_trylock(folio))
+			continue;
+		list_del_init(&folio->_deferred_list);
+		if (folio_batch_add(&batch, folio) == 0) {
+			--sc->nr_to_scan;
+			break;
 		}
 		if (!--sc->nr_to_scan)
 			break;
 	}
 	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
 
-	list_for_each_entry_safe(folio, next, &list, _deferred_list) {
-		if (!folio_trylock(folio))
-			goto next;
-		/* split_huge_page() removes page from list on success */
+	while ((folio = folio_batch_next(&batch)) != NULL) {
 		if (!split_folio(folio))
 			split++;
 		folio_unlock(folio);
-next:
-		folio_put(folio);
 	}
 
 	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
-	list_splice_tail(&list, &ds_queue->split_queue);
+	while ((folio = folio_batch_next(&batch)) != NULL) {
+		if (!folio_test_large(folio))
+			continue;
+		list_add_tail(&folio->_deferred_list, &ds_queue->split_queue);
+	}
 	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
 
+	folios_put(&batch);
+
 	/*
 	 * Stop shrinker if we didn't split any page, but the queue is empty.
 	 * This can happen if pages were freed under us.




[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux