pin_user_pages*() occasionally fails due to migrate_pages() failures that, in turn, are due to temporarily elevated folio refcounts. This happens because a few years ago, pin_user_pages*() APIs were upgraded to automatically migrate pages away from ZONE_MOVABLE, but the callers were not upgraded to handle any migration failures. And in fact, they can't easily do so anyway, because the migration return code was filtered out: -EAGAIN failures from migration are squashed, along with any other failure, into -ENOMEM, thus hiding details from the upper layer callers. One failure case that we ran into recently looks like this: pin_user_pages_fast() internal_get_user_pages_fast() __gup_longterm_locked() check_and_migrate_movable_pages() migrate_longterm_unpinnable_pages() migrate_pages() migrate_pages_batch(..., NR_MAX_MIGRATE_PAGES_RETRY==10) migrate_folio_move() move_to_new_folio() migrate_folio() migrate_folio_extra() folio_migrate_mapping() if (folio_ref_count(folio) != expected_count) return -EAGAIN; // ...and migrate_longterm_unpinnable_pages() // translates -EAGAIN to -ENOMEM Although so far I have not pinpointed the cause of such transient refcount increases, these are sufficiently common (and expected by the entire design) that I think we have enough information to proceed directly to a fix. This patch shows my preferred solution, which does the following: a) Restore the -EAGAIN return code: pass it unchanged all the way back to pin_user_pages*() callers. b) Then, retry pin_user_pages_fast() from ib_umem_get(), because that IB driver is displaying real failures in the field, and we've confirmed that a retry at this location will fix those failures. Retrying at this higher level (as compared to the pre-existing, low-level retry in migrate_pages_batch()) allows more things to happen, and more time to elapse, between retries. Cc: Mike Marciniszyn <mike.marciniszyn@xxxxxxxxx> Cc: Leon Romanovsky <leon@xxxxxxxxxx> Cc: Artemy Kovalyov <artemyko@xxxxxxxxxx> Cc: Michael Guralnik <michaelgur@xxxxxxxxxx> Cc: Alistair Popple <apopple@xxxxxxxxxx> Cc: Pak Markthub <pmarkthub@xxxxxxxxxx> Cc: Jason Gunthorpe <jgg@xxxxxxxxxx> Signed-off-by: John Hubbard <jhubbard@xxxxxxxxxx> --- drivers/infiniband/core/umem.c | 33 ++++++++++++++++++++++++++------- mm/gup.c | 14 +++++++++++--- 2 files changed, 37 insertions(+), 10 deletions(-) diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c index 07c571c7b699..7c691faacc8a 100644 --- a/drivers/infiniband/core/umem.c +++ b/drivers/infiniband/core/umem.c @@ -210,15 +210,34 @@ struct ib_umem *ib_umem_get(struct ib_device *device, unsigned long addr, while (npages) { cond_resched(); - pinned = pin_user_pages_fast(cur_base, - min_t(unsigned long, npages, - PAGE_SIZE / - sizeof(struct page *)), - gup_flags, page_list); - if (pinned < 0) { + pinned = -ENOMEM; + int attempts = 0; + /* + * pin_user_pages_fast() can return -EAGAIN, due to falling back + * to gup-slow and then failing to migrate pages out of + * ZONE_MOVABLE due to a transient elevated page refcount. + * + * One retry is enough to avoid this problem, so far, but let's + * use a slightly higher retry count just in case even larger + * systems have a longer-lasting transient refcount problem. + * + */ + static const int MAX_ATTEMPTS = 3; + + while (pinned == -EAGAIN && attempts < MAX_ATTEMPTS) { + pinned = pin_user_pages_fast(cur_base, + min_t(unsigned long, + npages, PAGE_SIZE / + sizeof(struct page *)), + gup_flags, page_list); ret = pinned; - goto umem_release; + attempts++; + + if (pinned == -EAGAIN) + continue; } + if (pinned < 0) + goto umem_release; cur_base += pinned * PAGE_SIZE; npages -= pinned; diff --git a/mm/gup.c b/mm/gup.c index 1611e73b1121..edb069f937cb 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -2141,15 +2141,23 @@ static int migrate_longterm_unpinnable_pages( } if (!list_empty(movable_page_list)) { + int rc; struct migration_target_control mtc = { .nid = NUMA_NO_NODE, .gfp_mask = GFP_USER | __GFP_NOWARN, }; - if (migrate_pages(movable_page_list, alloc_migration_target, + rc = migrate_pages(movable_page_list, alloc_migration_target, NULL, (unsigned long)&mtc, MIGRATE_SYNC, - MR_LONGTERM_PIN, NULL)) { - ret = -ENOMEM; + MR_LONGTERM_PIN, NULL); + if (rc) { + /* + * Report any failure *except* -EAGAIN as "not enough + * memory". -EAGAIN is valuable because callers further + * up the call stack can benefit from a retry. + */ + if (rc != -EAGAIN) + ret = -ENOMEM; goto err; } } base-commit: 18daea77cca626f590fb140fc11e3a43c5d41354 -- 2.45.0