On 5/1/24 6:05 PM, Alistair Popple wrote:
Jason Gunthorpe <jgg@xxxxxxxxxx> writes:
On Tue, Apr 30, 2024 at 10:10:43PM -0700, Christoph Hellwig wrote:
...
This doesn't make sense. IFF a blind retry is all that is needed it
should be done in the core functionality. I fear it's not that easy,
though.
+1
This migration retry weirdness is a GUP issue, it needs to be solved
in the mm not exposed to every pin_user_pages caller.
If it turns out ZONE_MOVEABLE pages can't actually be reliably moved
then it is pretty broken..
I wonder if we should remove the arbitrary retry limit in
migrate_pages() entirely for ZONE_MOVEABLE pages and just loop until
they migrate? By definition there should only be transient references on
these pages so why do we need to limit the number of retries in the
first place?
Well, along those lines, I can confirm that this patch also fixes the
symptoms:
diff --git a/mm/migrate.c b/mm/migrate.c
index 73a052a382f1..faa67cc441e2 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1728,7 +1728,9 @@ static int migrate_pages_batch(struct list_head *from,
else
goto move;
case -EAGAIN:
- retry++;
+ /* For ZONE_MOVABLE folios, retry forever */
+ if (!folio_is_zone_movable(folio))
+ retry++;
thp_retry += is_thp;
nr_retry_pages += nr_pages;
break;
@@ -1786,7 +1788,9 @@ static int migrate_pages_batch(struct list_head *from,
*/
switch(rc) {
case -EAGAIN:
- retry++;
+ /* For ZONE_MOVABLE folios, retry forever */
+ if (!folio_is_zone_movable(folio))
+ retry++;
thp_retry += is_thp;
nr_retry_pages += nr_pages;
break;
thanks,
--
John Hubbard
NVIDIA