[RFC] RDMA/umem: pin_user_pages*() can temporarily fail due to migration glitches

John Hubbard <jhubbard@xxxxxxxxxx> · Tue, 30 Apr 2024 17:31:17 -0700

pin_user_pages*() occasionally fails due to migrate_pages() failures
that, in turn, are due to temporarily elevated folio refcounts. This
happens because a few years ago, pin_user_pages*() APIs were upgraded to
automatically migrate pages away from ZONE_MOVABLE, but the callers were
not upgraded to handle any migration failures. And in fact, they can't
easily do so anyway, because the migration return code was filtered out:
-EAGAIN failures from migration are squashed, along with any other
failure, into -ENOMEM, thus hiding details from the upper layer callers.

One failure case that we ran into recently looks like this:

pin_user_pages_fast()
  internal_get_user_pages_fast()
    __gup_longterm_locked()
      check_and_migrate_movable_pages()
        migrate_longterm_unpinnable_pages()
          migrate_pages()
            migrate_pages_batch(..., NR_MAX_MIGRATE_PAGES_RETRY==10)
              migrate_folio_move()
                  move_to_new_folio()
                    migrate_folio()
                      migrate_folio_extra()
                        folio_migrate_mapping()
                          if (folio_ref_count(folio) != expected_count)
                            return -EAGAIN;
                              // ...and migrate_longterm_unpinnable_pages()
                              // translates -EAGAIN to -ENOMEM

Although so far I have not pinpointed the cause of such transient
refcount increases, these are sufficiently common (and expected by the
entire design) that I think we have enough information to proceed
directly to a fix. This patch shows my preferred solution, which does
the following:

a) Restore the -EAGAIN return code: pass it unchanged all the way back
to pin_user_pages*() callers.

b) Then, retry pin_user_pages_fast() from ib_umem_get(), because that IB
driver is displaying real failures in the field, and we've confirmed
that a retry at this location will fix those failures. Retrying at this
higher level (as compared to the pre-existing, low-level retry in
migrate_pages_batch()) allows more things to happen, and more time to
elapse, between retries.

Cc: Mike Marciniszyn <mike.marciniszyn@xxxxxxxxx>
Cc: Leon Romanovsky <leon@xxxxxxxxxx>
Cc: Artemy Kovalyov <artemyko@xxxxxxxxxx>
Cc: Michael Guralnik <michaelgur@xxxxxxxxxx>
Cc: Alistair Popple <apopple@xxxxxxxxxx>
Cc: Pak Markthub <pmarkthub@xxxxxxxxxx>
Cc: Jason Gunthorpe <jgg@xxxxxxxxxx>
Signed-off-by: John Hubbard <jhubbard@xxxxxxxxxx>
---
 drivers/infiniband/core/umem.c | 33 ++++++++++++++++++++++++++-------
 mm/gup.c                       | 14 +++++++++++---
 2 files changed, 37 insertions(+), 10 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 07c571c7b699..7c691faacc8a 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -210,15 +210,34 @@ struct ib_umem *ib_umem_get(struct ib_device *device, unsigned long addr,
 
 	while (npages) {
 		cond_resched();
-		pinned = pin_user_pages_fast(cur_base,
-					  min_t(unsigned long, npages,
-						PAGE_SIZE /
-						sizeof(struct page *)),
-					  gup_flags, page_list);
-		if (pinned < 0) {
+		pinned = -ENOMEM;
+		int attempts = 0;
+		/*
+		 * pin_user_pages_fast() can return -EAGAIN, due to falling back
+		 * to gup-slow and then failing to migrate pages out of
+		 * ZONE_MOVABLE due to a transient elevated page refcount.
+		 *
+		 * One retry is enough to avoid this problem, so far, but let's
+		 * use a slightly higher retry count just in case even larger
+		 * systems have a longer-lasting transient refcount problem.
+		 *
+		 */
+		static const int MAX_ATTEMPTS = 3;
+
+		while (pinned == -EAGAIN && attempts < MAX_ATTEMPTS) {
+			pinned = pin_user_pages_fast(cur_base,
+						     min_t(unsigned long,
+							npages, PAGE_SIZE /
+							sizeof(struct page *)),
+						     gup_flags, page_list);
 			ret = pinned;
-			goto umem_release;
+			attempts++;
+
+			if (pinned == -EAGAIN)
+				continue;
 		}
+		if (pinned < 0)
+			goto umem_release;
 
 		cur_base += pinned * PAGE_SIZE;
 		npages -= pinned;
diff --git a/mm/gup.c b/mm/gup.c
index 1611e73b1121..edb069f937cb 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2141,15 +2141,23 @@ static int migrate_longterm_unpinnable_pages(
 	}
 
 	if (!list_empty(movable_page_list)) {
+		int rc;
 		struct migration_target_control mtc = {
 			.nid = NUMA_NO_NODE,
 			.gfp_mask = GFP_USER | __GFP_NOWARN,
 		};
 
-		if (migrate_pages(movable_page_list, alloc_migration_target,
+		rc = migrate_pages(movable_page_list, alloc_migration_target,
 				  NULL, (unsigned long)&mtc, MIGRATE_SYNC,
-				  MR_LONGTERM_PIN, NULL)) {
-			ret = -ENOMEM;
+				  MR_LONGTERM_PIN, NULL);
+		if (rc) {
+			/*
+			 * Report any failure *except* -EAGAIN as "not enough
+			 * memory". -EAGAIN is valuable because callers further
+			 * up the call stack can benefit from a retry.
+			 */
+			if (rc != -EAGAIN)
+				ret = -ENOMEM;
 			goto err;
 		}
 	}

base-commit: 18daea77cca626f590fb140fc11e3a43c5d41354
-- 
2.45.0