+ David Hildenbrand On Thu, 7 Nov 2024 15:56:12 -0800 Joanne Koong <joannelkoong@xxxxxxxxx> wrote: > In offline_pages(), do_migrate_range() may potentially retry forever if > the migration fails. Add a return value for do_migrate_range(), and > allow offline_page() to try migrating pages 5 times before erroring > out, similar to how migration failures in __alloc_contig_migrate_range() > is handled. I'm curious if this could cause unexpected behavioral differences to memory hotplugging users, and how '5' is chosen. Could you please enlighten me? > > Signed-off-by: Joanne Koong <joannelkoong@xxxxxxxxx> > --- > mm/memory_hotplug.c | 13 ++++++------- > 1 file changed, 6 insertions(+), 7 deletions(-) > > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c > index 621ae1015106..49402442ea3b 100644 > --- a/mm/memory_hotplug.c > +++ b/mm/memory_hotplug.c > @@ -1770,13 +1770,14 @@ static int scan_movable_pages(unsigned long start, unsigned long end, > return 0; > } > > -static void do_migrate_range(unsigned long start_pfn, unsigned long end_pfn) > +static int do_migrate_range(unsigned long start_pfn, unsigned long end_pfn) Seems the return value is used for only knowing if it is failed or not. If there is no plan to use the error code in future, what about using bool return type? > { > struct folio *folio; > unsigned long pfn; > LIST_HEAD(source); > static DEFINE_RATELIMIT_STATE(migrate_rs, DEFAULT_RATELIMIT_INTERVAL, > DEFAULT_RATELIMIT_BURST); > + int ret = 0; > > for (pfn = start_pfn; pfn < end_pfn; pfn++) { > struct page *page; > @@ -1833,7 +1834,6 @@ static void do_migrate_range(unsigned long start_pfn, unsigned long end_pfn) > .gfp_mask = GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL, > .reason = MR_MEMORY_HOTPLUG, > }; > - int ret; > > /* > * We have checked that migration range is on a single zone so > @@ -1863,6 +1863,7 @@ static void do_migrate_range(unsigned long start_pfn, unsigned long end_pfn) > putback_movable_pages(&source); > } > } > + return ret; > } > > static int __init cmdline_parse_movable_node(char *p) > @@ -1940,6 +1941,7 @@ int offline_pages(unsigned long start_pfn, unsigned long nr_pages, > const int node = zone_to_nid(zone); > unsigned long flags; > struct memory_notify arg; > + unsigned int tries = 0; > char *reason; > int ret; > > @@ -2028,11 +2030,8 @@ int offline_pages(unsigned long start_pfn, unsigned long nr_pages, > > ret = scan_movable_pages(pfn, end_pfn, &pfn); > if (!ret) { > - /* > - * TODO: fatal migration failures should bail > - * out > - */ > - do_migrate_range(pfn, end_pfn); > + if (do_migrate_range(pfn, end_pfn) && ++tries == 5) > + ret = -EBUSY; > } In the '++tries == 5' case, users will show the failure reason as "unmovable page" from the debug log. What about setting 'reason' here to be more specific, e.g., "multiple migration failures"? Also, my humble understanding of the intention of this change is as follow. If there are 'AS_WRITEBACK_MAY_BLOCK' pages in the migration target range, do_migrate_range() will continuously fail. And hence this could become infinite loop. Pleae let me know if I'm misunderstanding. But if I'm not wrong... There is a check for expected failures above (scan_movable_pages()). What about adding 'AS_WRITEBACK_MAY_BLOCK' pages existence check there? > } while (!ret); > > -- > 2.43.5 Thanks, SJ