+ mm-memory_hotplug-do-not-fail-offlining-too-early.patch added to -mm tree

akpm@xxxxxxxxxxxxxxxxxxxx · Wed, 27 Sep 2017 15:40:26 -0700

The patch titled
     Subject: mm, memory_hotplug: do not fail offlining too early
has been added to the -mm tree.  Its filename is
     mm-memory_hotplug-do-not-fail-offlining-too-early.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-memory_hotplug-do-not-fail-offlining-too-early.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-memory_hotplug-do-not-fail-offlining-too-early.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/SubmitChecklist when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Michal Hocko <mhocko@xxxxxxxx>
Subject: mm, memory_hotplug: do not fail offlining too early

Patch series "mm, memory_hotplug: redefine memory offline retry logic", v2.

While testing memory hotplug on a large 4TB machine we have noticed that
memory offlining is just too eager to fail.  The primary reason is that
the retry logic is just too easy to give up.  We have 4 ways out of the
offline

	- we have a permanent failure (isolation or memory notifiers fail,
	  or hugetlb pages cannot be dropped)
	- userspace sends a signal
	- a hardcoded 120s timeout expires
	- page migration fails 5 times

This is way too convoluted and it doesn't scale very well.  We have seen
both temporary migration failures as well as 120s being triggered.  After
removing those restrictions we were able to pass stress testing during
memory hot remove without any other negative side effects observed. 
Therefore I suggest dropping both hard coded policies.  I couldn't have
found any specific reason for them in the changelog.  I neither didn't get
any response [1] from Kamezawa.  If we need some upper bound - e.g. 
timeout based - then we should have a proper and user defined policy for
that.  In any case there should be a clear use case when introducing it. 


This patch (of 2):

Memory offlining can fail too eagerly under heavy memory pressure.

[ 5410.336792] page:ffffea22a646bd00 count:255 mapcount:252 mapping:ffff88ff926c9f38 index:0x3
[ 5410.336809] flags: 0x9855fe40010048(uptodate|active|mappedtodisk)
[ 5410.336811] page dumped because: isolation failed
[ 5410.336813] page->mem_cgroup:ffff8801cd662000
[ 5420.655030] memory offlining [mem 0x18b580000000-0x18b5ffffffff] failed

Isolation has failed here because the page is not on LRU.  Most probably
because it was on the pcp LRU cache or it has been removed from the LRU
already but it hasn't been freed yet.  In both cases the page doesn't look
non-migrable so retrying more makes sense.

__offline_pages seems rather cluttered when it comes to the retry logic. 
We have 5 retries at maximum and a timeout.  We could argue whether the
timeout makes sense but failing just because of a race when somebody
isoltes a page from LRU or puts it on a pcp LRU lists is just wrong.  It
only takes it to race with a process which unmaps some pages and remove
them from the LRU list and we can fail the whole offline because of
something that is a temporary condition and actually not harmful for the
offline.

Please note that unmovable pages should be already excluded during
start_isolate_page_range.  We could argue that has_unmovable_pages is racy
and MIGRATE_MOVABLE check doesn't provide any hard guarantee either but
kernel zones (aka < ZONE_MOVABLE) will very likely detect unmovable pages
in most cases and movable zone shouldn't contain unmovable pages at all. 
Some of those pages might be pinned but not for ever because that would be
a bug on its own.  In any case the context is still interruptible and so
the userspace can easily bail out when the operation takes too long.  This
is certainly better behavior than a hardcoded retry loop which is racy.

Fix this by removing the max retry count and only rely on the timeout
resp.  interruption by a signal from the userspace.  Also retry rather
than fail when check_pages_isolated sees some !free pages because those
could be a result of the race as well.

Link: http://lkml.kernel.org/r/20170918070834.13083-2-mhocko@xxxxxxxxxx
Signed-off-by: Michal Hocko <mhocko@xxxxxxxx>
Acked-by: Vlastimil Babka <vbabka@xxxxxxx>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@xxxxxxxxxxxxxx>
Cc: Reza Arbab <arbab@xxxxxxxxxxxxxxxxxx>
Cc: Yasuaki Ishimatsu <yasu.isimatu@xxxxxxxxx>
Cc: Xishi Qiu <qiuxishi@xxxxxxxxxx>
Cc: Igor Mammedov <imammedo@xxxxxxxxxx>
Cc: Vitaly Kuznetsov <vkuznets@xxxxxxxxxx>
Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
---

 mm/memory_hotplug.c |   40 ++++++++++------------------------------
 1 file changed, 10 insertions(+), 30 deletions(-)

diff -puN mm/memory_hotplug.c~mm-memory_hotplug-do-not-fail-offlining-too-early mm/memory_hotplug.c

--- a/mm/memory_hotplug.c~mm-memory_hotplug-do-not-fail-offlining-too-early
+++ a/mm/memory_hotplug.c
@@ -1598,7 +1598,7 @@ static int __ref __offline_pages(unsigne
 {
 	unsigned long pfn, nr_pages, expire;
 	long offlined_pages;
-	int ret, drain, retry_max, node;
+	int ret, node;
 	unsigned long flags;
 	unsigned long valid_start, valid_end;
 	struct zone *zone;
@@ -1635,43 +1635,25 @@ static int __ref __offline_pages(unsigne
 
 	pfn = start_pfn;
 	expire = jiffies + timeout;
-	drain = 0;
-	retry_max = 5;
 repeat:
 	/* start memory hot removal */
-	ret = -EAGAIN;
+	ret = -EBUSY;
 	if (time_after(jiffies, expire))
 		goto failed_removal;
 	ret = -EINTR;
 	if (signal_pending(current))
 		goto failed_removal;
-	ret = 0;
-	if (drain) {
-		lru_add_drain_all_cpuslocked();
-		cond_resched();
-		drain_all_pages(zone);
-	}
+
+	cond_resched();
+	lru_add_drain_all_cpuslocked();
+	drain_all_pages(zone);
 
 	pfn = scan_movable_pages(start_pfn, end_pfn);
 	if (pfn) { /* We have movable pages */
 		ret = do_migrate_range(pfn, end_pfn);
-		if (!ret) {
-			drain = 1;
-			goto repeat;
-		} else {
-			if (ret < 0)
-				if (--retry_max == 0)
-					goto failed_removal;
-			yield();
-			drain = 1;
-			goto repeat;
-		}
+		goto repeat;
 	}
-	/* drain all zone's lru pagevec, this is asynchronous... */
-	lru_add_drain_all_cpuslocked();
-	yield();
-	/* drain pcp pages, this is synchronous. */
-	drain_all_pages(zone);
+
 	/*
 	 * dissolve free hugepages in the memory block before doing offlining
 	 * actually in order to make hugetlbfs's object counting consistent.
@@ -1681,10 +1663,8 @@ repeat:
 		goto failed_removal;
 	/* check again */
 	offlined_pages = check_pages_isolated(start_pfn, end_pfn);
-	if (offlined_pages < 0) {
-		ret = -EBUSY;
-		goto failed_removal;
-	}
+	if (offlined_pages < 0)
+		goto repeat;
 	pr_info("Offlined Pages %ld\n", offlined_pages);
 	/* Ok, all of our target is isolated.
 	   We cannot do rollback at this point. */
_

Patches currently in -mm which might be from mhocko@xxxxxxxx are

mm-oom_reaper-skip-mm-structs-with-mmu-notifiers.patch
mm-memcg-remove-hotplug-locking-from-try_charge.patch
mm-memory_hotplug-add-scheduling-point-to-__add_pages.patch
mm-page_alloc-add-scheduling-point-to-memmap_init_zone.patch
memremap-add-scheduling-point-to-devm_memremap_pages.patch
mm-memory_hotplug-do-not-back-off-draining-pcp-free-pages-from-kworker-context.patch
mm-memory_hotplug-do-not-fail-offlining-too-early.patch
mm-memory_hotplug-remove-timeout-from-__offline_memory.patch

--
To unsubscribe from this list: send the line "unsubscribe mm-commits" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html