Re: [PATCH v4 4/6] mm/memory-hotplug: add finite retries in offline_pages() if migration fails

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Nov 8, 2024 at 9:33 AM SeongJae Park <sj@xxxxxxxxxx> wrote:
>
> + David Hildenbrand
>
> On Thu, 7 Nov 2024 15:56:12 -0800 Joanne Koong <joannelkoong@xxxxxxxxx> wrote:
>
> > In offline_pages(), do_migrate_range() may potentially retry forever if
> > the migration fails. Add a return value for do_migrate_range(), and
> > allow offline_page() to try migrating pages 5 times before erroring
> > out, similar to how migration failures in __alloc_contig_migrate_range()
> > is handled.
>

Hi SeongJae,

Thanks for taking a look. I'm going to drop this patch per the
conversation with David and Shakeel below, but wanted to reply back to
some of the questions here for completion's sake.

> I'm curious if this could cause unexpected behavioral differences to memory
> hotplugging users, and how '5' is chosen.  Could you please enlighten me?
>

Most of this logic was copied from  __alloc_contig_migrate_range() -
in that function, '5' is hard-coded as the number of times to retry
for migration failures. No other reason for '5' other than that.

> >
> > Signed-off-by: Joanne Koong <joannelkoong@xxxxxxxxx>
> > ---
> >  mm/memory_hotplug.c | 13 ++++++-------
> >  1 file changed, 6 insertions(+), 7 deletions(-)
> >
> > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> > index 621ae1015106..49402442ea3b 100644
> > --- a/mm/memory_hotplug.c
> > +++ b/mm/memory_hotplug.c
> > @@ -1770,13 +1770,14 @@ static int scan_movable_pages(unsigned long start, unsigned long end,
> >       return 0;
> >  }
> >
> > -static void do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
> > +static int do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
>
> Seems the return value is used for only knowing if it is failed or not.  If
> there is no plan to use the error code in future, what about using bool return
> type?
>
> >  {
> >       struct folio *folio;
> >       unsigned long pfn;
> >       LIST_HEAD(source);
> >       static DEFINE_RATELIMIT_STATE(migrate_rs, DEFAULT_RATELIMIT_INTERVAL,
> >                                     DEFAULT_RATELIMIT_BURST);
> > +     int ret = 0;
> >
> >       for (pfn = start_pfn; pfn < end_pfn; pfn++) {
> >               struct page *page;
> > @@ -1833,7 +1834,6 @@ static void do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
> >                       .gfp_mask = GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL,
> >                       .reason = MR_MEMORY_HOTPLUG,
> >               };
> > -             int ret;
> >
> >               /*
> >                * We have checked that migration range is on a single zone so
> > @@ -1863,6 +1863,7 @@ static void do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
> >                       putback_movable_pages(&source);
> >               }
> >       }
> > +     return ret;
> >  }
> >
> >  static int __init cmdline_parse_movable_node(char *p)
> > @@ -1940,6 +1941,7 @@ int offline_pages(unsigned long start_pfn, unsigned long nr_pages,
> >       const int node = zone_to_nid(zone);
> >       unsigned long flags;
> >       struct memory_notify arg;
> > +     unsigned int tries = 0;
> >       char *reason;
> >       int ret;
> >
> > @@ -2028,11 +2030,8 @@ int offline_pages(unsigned long start_pfn, unsigned long nr_pages,
> >
> >                       ret = scan_movable_pages(pfn, end_pfn, &pfn);
> >                       if (!ret) {
> > -                             /*
> > -                              * TODO: fatal migration failures should bail
> > -                              * out
> > -                              */
> > -                             do_migrate_range(pfn, end_pfn);
> > +                             if (do_migrate_range(pfn, end_pfn) && ++tries == 5)
> > +                                     ret = -EBUSY;
> >                       }
>
> In the '++tries == 5' case, users will show the failure reason as "unmovable
> page" from the debug log.  What about setting 'reason' here to be more
> specific, e.g., "multiple migration failures"?
>
> Also, my humble understanding of the intention of this change is as follow.  If
> there are 'AS_WRITEBACK_MAY_BLOCK' pages in the migration target range,
> do_migrate_range() will continuously fail.  And hence this could become
> infinite loop.  Pleae let me know if I'm misunderstanding.
>
> But if I'm not wrong...  There is a check for expected failures above
> (scan_movable_pages()).  What about adding 'AS_WRITEBACK_MAY_BLOCK' pages
> existence check there?

The main difference between adding migrate_pages() retries (this
patch) vs adding an 'AS_WRITEBACK_MAY_BLOCK' check in
scan_movable_pages() is that in the latter, all pages in an
'AS_WRITEBACK_MAY_BLOCK' mapping will be skipped for migration whereas
in the former, only pages under writeback will be skipped. I think the
latter is probably fine too for this case but the former seemed a bit
more optimal to me.

Thanks,
Joanne


>
> >               } while (!ret);
> >
> > --
> > 2.43.5
>
>
> Thanks,
> SJ





[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux