On 23.05.2018 09:35, Oscar Salvador wrote: > Hi, > > This is something I spotted while testing offlining memory. > > __offline_pages() calls do_migrate_range() to try to migrate a range, > but we do not actually check for the error code. > This, besides of ignoring underlying failures, can led to a situations > where we never break up the loop because we are totally unaware of > what is going on. > > They way I spotted this was when trying to offline all memblocks belonging > to a node. > Due to an unfortunate setting with movablecore, memblocks containing bootmem > memory (pages marked by get_page_bootmem()) ended up marked in zone_movable. > So while trying to remove that memory, the system failed in: > > do_migrate_range() > { > ... > if (PageLRU(page)) > ret = isolate_lru_page(page); > else > ret = isolate_movable_page(page, ISOLATE_UNEVICTABLE); > > if (!ret) > // success: do something > else > if (page_count(page)) > ret = -EBUSY; > ... > } > > Since the pages from bootmem are not LRU, we call isolate_movable_page() > but we fail when checking for __PageMovable(). > Since the page_count is more than 0 we return -EBUSY, but we do not check this > in our caller, so we keep trying to migrate this memory over and over: > > repeat: > ... > pfn = scan_movable_pages(start_pfn, end_pfn); > if (pfn) { /* We have movable pages */ > ret = do_migrate_range(pfn, end_pfn); > goto repeat; > } > > But this is not only situation where we can get stuck. > For example, if we fail with -ENOMEM in > migrate_pages()->unmap_and_move()/unmap_and_move_huge_page(), we will keep trying as well. > I think we should really detect these cases and fail with "goto failed_removal". > Something like > > --- a/mm/memory_hotplug.c > +++ b/mm/memory_hotplug.c > @@ -1651,6 +1651,11 @@ static int __ref __offline_pages(unsigned long start_pfn, > pfn = scan_movable_pages(start_pfn, end_pfn); > if (pfn) { /* We have movable pages */ > ret = do_migrate_range(pfn, end_pfn); > + if (ret) { > + if (ret != -ENOMEM) > + ret = -EBUSY; > + goto failed_removal; > + } > goto repeat; > } > > Now, unless I overlooked something > migrate_pages()->unmap_and_move()/unmap_and_move_huge_page() can return: > -ENOMEM > -EAGAIN > -EBUSY > -ENOSYS. > > I am not sure if we should differentiate betweeen those errors. > For example, it is possible that in migrate_pages() we just get -EAGAIN, > and we return the number of "retry" we tried without having really failed. > Although, since we do 10 passes it might be considered as failed. > > And I am not sure either if we want to propagate the error codes, or in case we fail > in migrate_pages(), whatever the error was (-ENOMEM, -EBUSY, etc.), we > just return -EBUSY. > > What do you think? Hi, While working on onlining/offlining of 4MB subsections I also stumbled over the return value of offline_pages(). It would be nice if the interface could actually indicate if an error is permanent or only temporary. For now I have to live with the assumption, that whenever this function is not -EAGAIN or 0, that I simply have to retry later. David > > Thanks > Oscar Salvador > -- Thanks, David / dhildenb