[PATCH] [RFC] mm: migrate: rcu stalls because of invalid swap cache entries

Charan Teja Kalla <quic_charante@xxxxxxxxxxx> · Tue, 21 Nov 2023 18:00:40 +0530

The below race on a folio between reclaim and migration exposed a bug
of not populating the swap cache with proper folio resulting into the
rcu stalls:

Reclaim                         migration
			      (from mem offline)
--------                      ------------------
1) folio_trylock();

2) add_to_swap():
   a) get swap_entry to
     store the thp folio
     through folio_alloc_swap().

   b) do add_to_swap_cache() on
      a folio, which fills the xarray
      with folio corresponding to
      indexes of swap entries. Also
      dirty this folio.

   c) try_to_unmap(folio, TTU_SPLIT_HUGE_PMD)
      which splits the pmd, unmaps the folio
      and replace the mapped entries with
      swap entries.

   d) as the folio is dirty do,
     pageout()::mapping->a_ops->writepage().
     This calls swap_writepage() which unlock
     the page from 1) and do submit_bio().

3) Since the page can still be under writeback,
add the folio added back to the LRU.

				4) As the folio now on LRU,
				it is visible to migration
				thus will endup in
				offline_pages()->migrate_pages():
				  a) isolate the folio.
				  b) do __unmap_and_move():
				     1) lock the folio and wait till
					writeback is done.
				     2) Replace the eligible pte entries
					with migrate and then issue the
					move_to_new_folio(), which calls
					migrate_folio()->
					folio_migrate_mapping(), for the
					pages on the swap cache which
					just replace a single swap cache
					entry source folio with
					destination folio and can endup
					in freeing the source folio.

Now A process in parallel can endup in do_swap_page() which will try
read the stale entry(of source folio) after step4 above and thus will
endup in the below loop with rcu lock held.
mapping_get_entry():
 rcu_read_lock();
repeat:
    xas_reset(&xas);
	folio = xas_load(&xas);
	if (!folio || xa_is_value(folio))
		goto out;

	if (!folio_try_get_rcu(folio))
		goto repeat;

folio_try_get_rcu():
	if (unlikely(!folio_ref_add_unless(folio, count, 0))) {
	   /* Either the folio has been freed, or will be freed. */
	   return false;

Because of the source folio is freed in 4.b.2) The above loop can
continue till the destination folio too is reclaimed where it is removed
from the swap cache and then set the swap cache entry to zero where the
xas_load() return 0 thus exit. And this destination folio can be either
removed immediately as part of the reclaim or can stay longer in the
swap cache because of parallel swapin happen between 3) and 4.b.1)(whose
valid pte mappings, pointing to the source folio, is replaced with the
destination folio). It is the latter case which is resulted into the rcu
stalls.

The similar sort of issue also reported sometime back and is fixed in
[1].

This issue seems to be introduced from the commit 6b24ca4a1a8d ("mm: Use
multi-index entries in the page cache"), in the function
folio_migrate_mapping()[2].

Since a large folio to be migrated and present in the swap cache can't
use the multi-index entries, and migrate code uses the same
folio_migrate_mapping() for migrating this folio, any inputs you can
provide to fix this issue, please?

What I have thought is, if the adjacent entry in the xarray is not a
sibling, then assume that it is not a multi-index entry thus store as
2^N consecutive entries.

[1] https://lore.kernel.org/all/20180406030706.GA2434@xxxxxxxxxxxxxxxxxxxxxxxxxxxx/T/#u
[2] https://lore.kernel.org/linux-mm/20210715033704.692967-128-willy@xxxxxxxxxxxxx/#Z31mm:migrate.c

Signed-off-by: Charan Teja Kalla <quic_charante@xxxxxxxxxxx>
---
 mm/migrate.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/mm/migrate.c b/mm/migrate.c
index 35a8833..05cb4a9b 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -403,6 +403,7 @@ int folio_migrate_mapping(struct address_space *mapping,
 	XA_STATE(xas, &mapping->i_pages, folio_index(folio));
 	struct zone *oldzone, *newzone;
 	int dirty;
+	void *entry;
 	int expected_count = folio_expected_refs(mapping, folio) + extra_count;
 	long nr = folio_nr_pages(folio);
 
@@ -454,6 +455,16 @@ int folio_migrate_mapping(struct address_space *mapping,
 	}
 
 	xas_store(&xas, newfolio);
+	entry = xas_next(&xas);
+
+	if (nr > 1 && !xa_is_sibling(entry)) {
+		int i;
+
+		for (i = 1; i < nr; ++i) {
+			xas_store(&xas, newfolio);
+			xas_next(&xas);
+		}
+	}
 
 	/*
 	 * Drop cache reference from old page by unfreezing
-- 
2.7.4