The patch titled Subject: mm, memory_hotplug: do not fail offlining too early has been added to the -mm tree. Its filename is mm-memory_hotplug-do-not-fail-offlining-too-early.patch This patch should soon appear at http://ozlabs.org/~akpm/mmots/broken-out/mm-memory_hotplug-do-not-fail-offlining-too-early.patch and later at http://ozlabs.org/~akpm/mmotm/broken-out/mm-memory_hotplug-do-not-fail-offlining-too-early.patch Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's *** Remember to use Documentation/SubmitChecklist when testing your code *** The -mm tree is included into linux-next and is updated there every 3-4 working days ------------------------------------------------------ From: Michal Hocko <mhocko@xxxxxxxx> Subject: mm, memory_hotplug: do not fail offlining too early Patch series "mm, memory_hotplug: redefine memory offline retry logic", v2. While testing memory hotplug on a large 4TB machine we have noticed that memory offlining is just too eager to fail. The primary reason is that the retry logic is just too easy to give up. We have 4 ways out of the offline - we have a permanent failure (isolation or memory notifiers fail, or hugetlb pages cannot be dropped) - userspace sends a signal - a hardcoded 120s timeout expires - page migration fails 5 times This is way too convoluted and it doesn't scale very well. We have seen both temporary migration failures as well as 120s being triggered. After removing those restrictions we were able to pass stress testing during memory hot remove without any other negative side effects observed. Therefore I suggest dropping both hard coded policies. I couldn't have found any specific reason for them in the changelog. I neither didn't get any response [1] from Kamezawa. If we need some upper bound - e.g. timeout based - then we should have a proper and user defined policy for that. In any case there should be a clear use case when introducing it. This patch (of 2): Memory offlining can fail too eagerly under heavy memory pressure. [ 5410.336792] page:ffffea22a646bd00 count:255 mapcount:252 mapping:ffff88ff926c9f38 index:0x3 [ 5410.336809] flags: 0x9855fe40010048(uptodate|active|mappedtodisk) [ 5410.336811] page dumped because: isolation failed [ 5410.336813] page->mem_cgroup:ffff8801cd662000 [ 5420.655030] memory offlining [mem 0x18b580000000-0x18b5ffffffff] failed Isolation has failed here because the page is not on LRU. Most probably because it was on the pcp LRU cache or it has been removed from the LRU already but it hasn't been freed yet. In both cases the page doesn't look non-migrable so retrying more makes sense. __offline_pages seems rather cluttered when it comes to the retry logic. We have 5 retries at maximum and a timeout. We could argue whether the timeout makes sense but failing just because of a race when somebody isoltes a page from LRU or puts it on a pcp LRU lists is just wrong. It only takes it to race with a process which unmaps some pages and remove them from the LRU list and we can fail the whole offline because of something that is a temporary condition and actually not harmful for the offline. Please note that unmovable pages should be already excluded during start_isolate_page_range. We could argue that has_unmovable_pages is racy and MIGRATE_MOVABLE check doesn't provide any hard guarantee either but kernel zones (aka < ZONE_MOVABLE) will very likely detect unmovable pages in most cases and movable zone shouldn't contain unmovable pages at all. Some of those pages might be pinned but not for ever because that would be a bug on its own. In any case the context is still interruptible and so the userspace can easily bail out when the operation takes too long. This is certainly better behavior than a hardcoded retry loop which is racy. Fix this by removing the max retry count and only rely on the timeout resp. interruption by a signal from the userspace. Also retry rather than fail when check_pages_isolated sees some !free pages because those could be a result of the race as well. Link: http://lkml.kernel.org/r/20170918070834.13083-2-mhocko@xxxxxxxxxx Signed-off-by: Michal Hocko <mhocko@xxxxxxxx> Acked-by: Vlastimil Babka <vbabka@xxxxxxx> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@xxxxxxxxxxxxxx> Cc: Reza Arbab <arbab@xxxxxxxxxxxxxxxxxx> Cc: Yasuaki Ishimatsu <yasu.isimatu@xxxxxxxxx> Cc: Xishi Qiu <qiuxishi@xxxxxxxxxx> Cc: Igor Mammedov <imammedo@xxxxxxxxxx> Cc: Vitaly Kuznetsov <vkuznets@xxxxxxxxxx> Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> --- mm/memory_hotplug.c | 40 ++++++++++------------------------------ 1 file changed, 10 insertions(+), 30 deletions(-) diff -puN mm/memory_hotplug.c~mm-memory_hotplug-do-not-fail-offlining-too-early mm/memory_hotplug.c --- a/mm/memory_hotplug.c~mm-memory_hotplug-do-not-fail-offlining-too-early +++ a/mm/memory_hotplug.c @@ -1598,7 +1598,7 @@ static int __ref __offline_pages(unsigne { unsigned long pfn, nr_pages, expire; long offlined_pages; - int ret, drain, retry_max, node; + int ret, node; unsigned long flags; unsigned long valid_start, valid_end; struct zone *zone; @@ -1635,43 +1635,25 @@ static int __ref __offline_pages(unsigne pfn = start_pfn; expire = jiffies + timeout; - drain = 0; - retry_max = 5; repeat: /* start memory hot removal */ - ret = -EAGAIN; + ret = -EBUSY; if (time_after(jiffies, expire)) goto failed_removal; ret = -EINTR; if (signal_pending(current)) goto failed_removal; - ret = 0; - if (drain) { - lru_add_drain_all_cpuslocked(); - cond_resched(); - drain_all_pages(zone); - } + + cond_resched(); + lru_add_drain_all_cpuslocked(); + drain_all_pages(zone); pfn = scan_movable_pages(start_pfn, end_pfn); if (pfn) { /* We have movable pages */ ret = do_migrate_range(pfn, end_pfn); - if (!ret) { - drain = 1; - goto repeat; - } else { - if (ret < 0) - if (--retry_max == 0) - goto failed_removal; - yield(); - drain = 1; - goto repeat; - } + goto repeat; } - /* drain all zone's lru pagevec, this is asynchronous... */ - lru_add_drain_all_cpuslocked(); - yield(); - /* drain pcp pages, this is synchronous. */ - drain_all_pages(zone); + /* * dissolve free hugepages in the memory block before doing offlining * actually in order to make hugetlbfs's object counting consistent. @@ -1681,10 +1663,8 @@ repeat: goto failed_removal; /* check again */ offlined_pages = check_pages_isolated(start_pfn, end_pfn); - if (offlined_pages < 0) { - ret = -EBUSY; - goto failed_removal; - } + if (offlined_pages < 0) + goto repeat; pr_info("Offlined Pages %ld\n", offlined_pages); /* Ok, all of our target is isolated. We cannot do rollback at this point. */ _ Patches currently in -mm which might be from mhocko@xxxxxxxx are mm-oom_reaper-skip-mm-structs-with-mmu-notifiers.patch mm-memcg-remove-hotplug-locking-from-try_charge.patch mm-memory_hotplug-add-scheduling-point-to-__add_pages.patch mm-page_alloc-add-scheduling-point-to-memmap_init_zone.patch memremap-add-scheduling-point-to-devm_memremap_pages.patch mm-memory_hotplug-do-not-back-off-draining-pcp-free-pages-from-kworker-context.patch mm-memory_hotplug-do-not-fail-offlining-too-early.patch mm-memory_hotplug-remove-timeout-from-__offline_memory.patch -- To unsubscribe from this list: send the line "unsubscribe mm-commits" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html