On Wed 14-11-18 17:04:37, Piotr Jaroszynski wrote: > On 11/14/18 1:22 PM, Michal Hocko wrote: > > On Wed 14-11-18 10:04:45, Piotr Jaroszynski wrote: > >> On 11/14/18 3:29 AM, Michal Hocko wrote: > >>> On Wed 14-11-18 08:34:15, Michal Hocko wrote: > >>>> On Tue 13-11-18 16:40:59, p.jaroszynski@xxxxxxxxx wrote: > >>>>> From: Piotr Jaroszynski <pjaroszynski@xxxxxxxxxx> > >>>>> > >>>>> migrate_pages() can return the number of pages that failed to migrate > >>>>> instead of 0 or an error code. If that happens, the positive return is > >>>>> treated as an error all the way up through the stack leading to the > >>>>> move_pages() syscall returning a positive number. I believe this > >>>>> regressed with commit a49bd4d71637 ("mm, numa: rework do_pages_move") > >>>>> that refactored a lot of this code. > >>>> > >>>> Yes this is correct. > >>>> > >>>>> Fix this by treating positive returns as success in > >>>>> do_move_pages_to_node() as that seems to most closely follow the > >>>>> previous code. This still leaves the question whether silently > >>>>> considering this case a success is the right thing to do as even the > >>>>> status of the pages will be set as if they were successfully migrated, > >>>>> but that seems to have been the case before as well. > >>>> > >>>> Yes, I believe the previous semantic was just wrong and we want to fix > >>>> it. Jan has already brought this up [1]. I believe we want to update the > >>>> documentation rather than restore the previous hazy semantic. > >> > >> That's probably fair although at least some code we have will have to be > >> updated as it just checks for non-zero returns from move_pages() and > >> assumes errno is set when that happens. > > > > Can you tell me more about your usecase plase? I definitely do not want > > to break any existing userspace. Making the syscall return code more > > reasonable is still attractive. So if this new semantic can work better > > for you it would be one argument more to keep it this way. > > > > One of our APIs exposes a way to move a VA range to a GPU NUMA node or one of > the CPU NUMA nodes. The code keeps retrying move_pages() and relies on > the reported page status to decide whether each page is done, needs a > retry (EAGAIN or EBUSY) or possibly needs a fallback (EMEM). > > With the previous behaviour we would get a success, but the page status > would be reported incorrectly. That's bad as we skip the migration > without knowing about it. Exactly. > With the current code we get what we interpret as success as errno is 0, > but the page status is gargabe/untouched. That's also bad. Agreed. > The proposed solution adds a new case to handle, but it will just tell > us the page status is unusable and all we can do is just retry blindly. > If it was possible to plumb through the migration status for each page > accurately that would allow us to save redoing the call for pages that > actually worked. Perhaps we would need a special status for pages > skipped due to errors. This would be possible but with this patch applied you should know how many pages to skip from the tail of the array. > But maybe this is all a tiny corner case short of the bug I hit (see > more below) and it's not worth thinking too much about. > > >>>> Just wondering, how have you found out? Is there any real application > >>>> failing because of the change or this is a result of some test? > >> > >> I have a test that creates a tmp file, mmaps it as shared, memsets the > >> memory and then attempts to move it to a different node. It used to > >> work, but now fails. I suspect the filesystem's migratepage() callback > >> regressed and will look into it next. So far I have only tested this on > >> powerpc with the xfs filesystem. > > > > I would be surprise if the rewor changed the migration behavior. > > It didn't, I tracked it down to the new fs/iomap.c code used by xfs not > being compatible with migrate_page_move_mapping() and prepared a perhaps > naive fix in [1]. I am not familiar with iomap code much TBH so I cannot really judge your fix. -- Michal Hocko SUSE Labs