Re: How to handle TIF_MEMDIE stalls?

Johannes Weiner <hannes@xxxxxxxxxxx> · Tue, 23 Dec 2014 04:41:32 -0500

On Tue, Dec 23, 2014 at 08:30:58AM +1100, Dave Chinner wrote:
> On Mon, Dec 22, 2014 at 05:57:36PM +0100, Michal Hocko wrote:
> > On Mon 22-12-14 07:42:49, Dave Chinner wrote:
> > [...]
> > > "memory reclaim gave up"? So why the hell isn't it returning a
> > > failure to the caller?
> > > 
> > > i.e. We have a perfectly good page cache allocation failure error
> > > path here all the way back to userspace, but we're invoking the
> > > OOM-killer to kill random processes rather than returning ENOMEM to
> > > the processes that are generating the memory demand?
> > > 
> > > Further: when did the oom-killer become the primary method
> > > of handling situations when memory allocation needs to fail?
> > > __GFP_WAIT does *not* mean memory allocation can't fail - that's what
> > > __GFP_NOFAIL means. And none of the page cache allocations use
> > > __GFP_NOFAIL, so why aren't we getting an allocation failure before
> > > the oom-killer is kicked?
> > 
> > Well, it has been an unwritten rule that GFP_KERNEL allocations for
> > low-order (<=PAGE_ALLOC_COSTLY_ORDER) never fail. This is a long ago
> > decision which would be tricky to fix now without silently breaking a
> > lot of code. Sad...
> 
> Wow.
> 
> We have *always* been told memory allocations are not guaranteed to
> succeed, ever, unless __GFP_NOFAIL is set, but that's deprecated and
> nobody is allowed to use it any more.
> 
> Lots of code has dependencies on memory allocation making progress
> or failing for the system to work in low memory situations. The page
> cache is one of them, which means all filesystems have that
> dependency. We don't explicitly ask memory allocations to fail, we
> *expect* the memory allocation failures will occur in low memory
> conditions. We've been designing and writing code with this in mind
> for the past 15 years.
> 
> How did we get so far away from the message of "the memory allocator
> never guarantees success" that it will never fail to allocate memory
> even if it means we livelock the entire system?

I think this isn't as much an allocation guarantee as it is based on
the thought that once we can't satisfy such low orders anymore the
system is so entirely unusable that the only remaining thing to do is
to kill processes one by one until the situation is resolved.

Hard to say, though, because this has been the behavior for longer
than the initial git import of the tree, without any code comment.

And yes, it's flawed, because the allocating task looping might be
what's holding up progress, as we can see here.

> > Nevertheless the caller can prevent from an endless loop by using
> > __GFP_NORETRY so this could be used as a workaround.
> 
> That's just a never-ending game of whack-a-mole that we will
> continually lose. It's not a workable solution.

Agreed.

> > The default should be opposite IMO and only those who really
> > require some guarantee should use a special flag for that purpose.
> 
> Yup, totally agree.

So how about something like the following change?  It restricts the
allocator's endless OOM killing loop to __GFP_NOFAIL contexts, which
are annotated in the callsite and thus easier to review for locks etc.
Otherwise, the allocator tries only as long as page reclaim makes
progress, the idea being that failures are handled gracefully in the
callsites, and page faults restarting automatically anyway.  The OOM
killing in that case is deferred to the end of the exception handler.

Preliminary testing confirms that the system is indeed trying just as
hard before OOM killing in the page fault case.  However, it doesn't
look like all callsites are prepared for failing smaller allocations:

[   55.553822] Out of memory: Kill process 240 (anonstress) score 158 or sacrifice child
[   55.561787] Killed process 240 (anonstress) total-vm:1540044kB, anon-rss:1284068kB, file-rss:468kB
[   55.571083] BUG: unable to handle kernel paging request at 00000000004006bd
[   55.578156] IP: [<00000000004006bd>] 0x4006bd
[   55.582584] PGD c8f3f067 PUD c8f48067 PMD c8f15067 PTE 0
[   55.588016] Oops: 0014 [#1] SMP 
[   55.591337] CPU: 1 PID: 240 Comm: anonstress Not tainted 3.18.0-mm1-00081-gf6137925fc97-dirty #188
[   55.600435] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./H61M-DGS, BIOS P1.30 05/10/2012
[   55.610030] task: ffff8802139b9a10 ti: ffff8800c8f64000 task.ti: ffff8800c8f64000
[   55.617623] RIP: 0033:[<00000000004006bd>]  [<00000000004006bd>] 0x4006bd
[   55.624512] RSP: 002b:00007fffd43b7220  EFLAGS: 00010206
[   55.629901] RAX: 00007f87e6e01000 RBX: 0000000000000000 RCX: 00007f87f64fe25a
[   55.637104] RDX: 00007f879881a000 RSI: 000000005dc00000 RDI: 0000000000000000
[   55.644331] RBP: 00007fffd43b7240 R08: 00000000ffffffff R09: 0000000000000000
[   55.651569] R10: 0000000000000022 R11: 0000000000000283 R12: 0000000000400570
[   55.658796] R13: 00007fffd43b7340 R14: 0000000000000000 R15: 0000000000000000
[   55.666040] FS:  00007f87f69d1700(0000) GS:ffff88021f280000(0000) knlGS:0000000000000000
[   55.674221] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   55.680055] CR2: 00007fdd676ad480 CR3: 00000000c8f3e000 CR4: 00000000000407e0
[   55.687272] 
[   55.688780] RIP  [<00000000004006bd>] 0x4006bd
[   55.693304]  RSP <00007fffd43b7220>
[   55.696850] CR2: 00000000004006bd
[   55.700207] ---[ end trace b9cb4f44f8e47bc3 ]---
[   55.704903] Kernel panic - not syncing: Fatal exception
[   55.710208] Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffff9fffffff)
[   55.720517] Rebooting in 30 seconds..

Obvious bugs aside, though, the thought of failing order-0 allocations
after such a long time is scary...

---
>From 0b204ee379aa5502a1c4dce5df51de96448b5163 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@xxxxxxxxxxx>
Date: Mon, 22 Dec 2014 17:16:43 -0500
Subject: [patch] mm: page_alloc: avoid page allocation vs. OOM killing
 deadlock

The page allocator per default does not ever give up on allocations up
to order 3, and instead it keeps the allocating task in a loop running
direct reclaim and invoking the OOM killer.  The assumed reason behind
this decade-old behavior is that the system is unusable once orders of
such small size start to fail, and the allocator might as well keep
killing processes one-by-one until the situation is rectified.

However, the allocating task itself might be holding locks that the
OOM victim might need to exit, and, to preserve the emergency memory
reserves, the OOM killer doesn't move on to the next victim until the
first choice has exited.  The result is a deadlock between the task
that is trying to allocate and the OOM victim that can't be resolved
without a third party exiting or volunteering unreclaimable memory.

Do not automatically retry for <= order-3 after reclaim stops making
forward progress.  Callsites should be prepared to handle allocation
failures, and OOM kill-looping inside the allocator is restricted to
__GFP_NOFAIL allocations.  In case of page faults, it's preferrable to
fail the inner allocation and keep retrying the entire fault, where
the OOM killer is invoked at the end of the exception handler while
the task doesn't hold any locks.

Not-yet-signed-off-by: Johannes Weiner <hannes@xxxxxxxxxxx>
---
 mm/page_alloc.c | 45 +++++++++++++--------------------------------
 1 file changed, 13 insertions(+), 32 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e8f5997c557c..b2054ca2bc74 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2287,9 +2287,7 @@ void warn_alloc_failed(gfp_t gfp_mask, int order, const char *fmt, ...)
 }
 
 static inline int
-should_alloc_retry(gfp_t gfp_mask, unsigned int order,
-				unsigned long did_some_progress,
-				unsigned long pages_reclaimed)
+should_alloc_retry(gfp_t gfp_mask, unsigned long did_some_progress)
 {
 	/* Do not loop if specifically requested */
 	if (gfp_mask & __GFP_NORETRY)
@@ -2299,30 +2297,8 @@ should_alloc_retry(gfp_t gfp_mask, unsigned int order,
 	if (gfp_mask & __GFP_NOFAIL)
 		return 1;
 
-	/*
-	 * Suspend converts GFP_KERNEL to __GFP_WAIT which can prevent reclaim
-	 * making forward progress without invoking OOM. Suspend also disables
-	 * storage devices so kswapd will not help. Bail if we are suspending.
-	 */
-	if (!did_some_progress && pm_suspended_storage())
-		return 0;
-
-	/*
-	 * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
-	 * means __GFP_NOFAIL, but that may not be true in other
-	 * implementations.
-	 */
-	if (order <= PAGE_ALLOC_COSTLY_ORDER)
-		return 1;
-
-	/*
-	 * For order > PAGE_ALLOC_COSTLY_ORDER, if __GFP_REPEAT is
-	 * specified, then we retry until we no longer reclaim any pages
-	 * (above), or we've reclaimed an order of pages at least as
-	 * large as the allocation's order. In both cases, if the
-	 * allocation still fails, we stop retrying.
-	 */
-	if (gfp_mask & __GFP_REPEAT && pages_reclaimed < (1 << order))
+	/* Retry as long as reclaim is making progress */
+	if (did_some_progress)
 		return 1;
 
 	return 0;
@@ -2644,7 +2620,6 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	const gfp_t wait = gfp_mask & __GFP_WAIT;
 	struct page *page = NULL;
 	int alloc_flags;
-	unsigned long pages_reclaimed = 0;
 	unsigned long did_some_progress;
 	enum migrate_mode migration_mode = MIGRATE_ASYNC;
 	bool deferred_compaction = false;
@@ -2803,9 +2778,7 @@ retry:
 		goto got_pg;
 
 	/* Check if we should retry the allocation */
-	pages_reclaimed += did_some_progress;
-	if (should_alloc_retry(gfp_mask, order, did_some_progress,
-						pages_reclaimed)) {
+	if (should_alloc_retry(gfp_mask, did_some_progress)) {
 		/*
 		 * If we fail to make progress by freeing individual
 		 * pages, but the allocation wants us to keep going,
@@ -2841,7 +2814,15 @@ retry:
 	}
 
 nopage:
-	warn_alloc_failed(gfp_mask, order, NULL);
+	/*
+	 * XXX: Warn when initially failing allocations, but don't
+	 * bother when the OOM killer is already running, it makes
+	 * console output unreadable.
+	 */
+	if (oom_zonelist_trylock(zonelist, gfp_mask)) {
+		warn_alloc_failed(gfp_mask, order, NULL);
+		oom_zonelist_unlock(zonelist, gfp_mask);
+	}
 	return page;
 got_pg:
 	if (kmemcheck_enabled)
-- 
2.2.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>