Re: [PATCH] Fix do_move_pages_to_node() error handling

Michal Hocko <mhocko@xxxxxxxx> · Thu, 15 Nov 2018 09:47:52 +0100

On Wed 14-11-18 17:04:37, Piotr Jaroszynski wrote:
> On 11/14/18 1:22 PM, Michal Hocko wrote:
> > On Wed 14-11-18 10:04:45, Piotr Jaroszynski wrote:
> >> On 11/14/18 3:29 AM, Michal Hocko wrote:
> >>> On Wed 14-11-18 08:34:15, Michal Hocko wrote:
> >>>> On Tue 13-11-18 16:40:59, p.jaroszynski@xxxxxxxxx wrote:
> >>>>> From: Piotr Jaroszynski <pjaroszynski@xxxxxxxxxx>
> >>>>>
> >>>>> migrate_pages() can return the number of pages that failed to migrate
> >>>>> instead of 0 or an error code. If that happens, the positive return is
> >>>>> treated as an error all the way up through the stack leading to the
> >>>>> move_pages() syscall returning a positive number. I believe this
> >>>>> regressed with commit a49bd4d71637 ("mm, numa: rework do_pages_move")
> >>>>> that refactored a lot of this code.
> >>>>
> >>>> Yes this is correct.
> >>>>
> >>>>> Fix this by treating positive returns as success in
> >>>>> do_move_pages_to_node() as that seems to most closely follow the
> >>>>> previous code. This still leaves the question whether silently
> >>>>> considering this case a success is the right thing to do as even the
> >>>>> status of the pages will be set as if they were successfully migrated,
> >>>>> but that seems to have been the case before as well.
> >>>>
> >>>> Yes, I believe the previous semantic was just wrong and we want to fix
> >>>> it. Jan has already brought this up [1]. I believe we want to update the
> >>>> documentation rather than restore the previous hazy semantic.
> >>
> >> That's probably fair although at least some code we have will have to be
> >> updated as it just checks for non-zero returns from move_pages() and
> >> assumes errno is set when that happens.
> > 
> > Can you tell me more about your usecase plase? I definitely do not want
> > to break any existing userspace. Making the syscall return code more
> > reasonable is still attractive. So if this new semantic can work better
> > for you it would be one argument more to keep it this way.
> >  
> 
> One of our APIs exposes a way to move a VA range to a GPU NUMA node or one of
> the CPU NUMA nodes. The code keeps retrying move_pages() and relies on
> the reported page status to decide whether each page is done, needs a
> retry (EAGAIN or EBUSY) or possibly needs a fallback (EMEM).
> 
> With the previous behaviour we would get a success, but the page status
> would be reported incorrectly. That's bad as we skip the migration
> without knowing about it.

Exactly.

> With the current code we get what we interpret as success as errno is 0,
> but the page status is gargabe/untouched. That's also bad.

Agreed.

> The proposed solution adds a new case to handle, but it will just tell
> us the page status is unusable and all we can do is just retry blindly.
> If it was possible to plumb through the migration status for each page
> accurately that would allow us to save redoing the call for pages that
> actually worked. Perhaps we would need a special status for pages
> skipped due to errors.

This would be possible but with this patch applied you should know how
many pages to skip from the tail of the array.

> But maybe this is all a tiny corner case short of the bug I hit (see
> more below) and it's not worth thinking too much about.
> 
> >>>> Just wondering, how have you found out? Is there any real application
> >>>> failing because of the change or this is a result of some test?
> >>
> >> I have a test that creates a tmp file, mmaps it as shared, memsets the
> >> memory and then attempts to move it to a different node. It used to
> >> work, but now fails. I suspect the filesystem's migratepage() callback
> >> regressed and will look into it next. So far I have only tested this on
> >> powerpc with the xfs filesystem.
> > 
> > I would be surprise if the rewor changed the migration behavior.
> 
> It didn't, I tracked it down to the new fs/iomap.c code used by xfs not
> being compatible with migrate_page_move_mapping() and prepared a perhaps
> naive fix in [1].

I am not familiar with iomap code much TBH so I cannot really judge your
fix.
-- 
Michal Hocko
SUSE Labs