Re: [PATCH] mm,page_alloc: softlockup on warn_alloc on

Johannes Weiner <hannes@xxxxxxxxxxx> · Fri, 15 Sep 2017 07:37:32 -0700

On Fri, Sep 15, 2017 at 05:58:49PM +0800, wang Yu wrote:
> From: "yuwang.yuwang" <yuwang.yuwang@xxxxxxxxxxxxxxx>
> 
> I found a softlockup when running some stress testcase in 4.9.x,
> but i think the mainline have the same problem.
> 
> call trace:
> [365724.502896] NMI watchdog: BUG: soft lockup - CPU#31 stuck for 22s!
> [jbd2/sda3-8:1164]

We've started seeing the same thing on 4.11. Tons and tons of
allocation stall warnings followed by the soft lock-ups.

These allocation stalls happen when the allocating task reclaims
successfully yet isn't able to allocate, meaning other threads are
stealing those pages.

Now, it *looks* like something changed recently to make this race
window wider, and there might well be a bug there. But regardless, we
have a real livelock or at least starvation window here, where
reclaimers have their bounty continuously stolen by concurrent allocs;
but instead of recognizing and handling the situation, we flood the
console which in many cases adds fuel to the fire.

When threads cannibalize each other to the point where one of them can
reclaim but not allocate for 10s, it's safe to say we are out of
memory. I think we need something like the below regardless of any
other investigations and fixes into the root cause here.

But Michal, this needs an answer. We don't want to paper over bugs,
but we also cannot continue to ship a kernel that has a known issue
and for which there are mitigation fixes, root-caused or not.

How can we figure out if there is a bug here? Can we time the calls to
__alloc_pages_direct_reclaim() and __alloc_pages_direct_compact() and
drill down from there? Print out the number of times we have retried?
We're counting no_progress_loops, but we are also very much interested
in progress_loops that didn't result in a successful allocation. Too
many of those and I think we want to OOM kill as per above.

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index bec5e96f3b88..01736596389a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3830,6 +3830,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 			"page allocation stalls for %ums, order:%u",
 			jiffies_to_msecs(jiffies-alloc_start), order);
 		stall_timeout += 10 * HZ;
+		goto oom;
 	}
 
 	/* Avoid recursion of direct reclaim */
@@ -3882,6 +3883,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	if (read_mems_allowed_retry(cpuset_mems_cookie))
 		goto retry_cpuset;
 
+oom:
 	/* Reclaim has failed us, start killing things */
 	page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
 	if (page)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>