On Tue, Dec 23, 2014 at 08:30:58AM +1100, Dave Chinner wrote: > On Mon, Dec 22, 2014 at 05:57:36PM +0100, Michal Hocko wrote: > > On Mon 22-12-14 07:42:49, Dave Chinner wrote: > > [...] > > > "memory reclaim gave up"? So why the hell isn't it returning a > > > failure to the caller? > > > > > > i.e. We have a perfectly good page cache allocation failure error > > > path here all the way back to userspace, but we're invoking the > > > OOM-killer to kill random processes rather than returning ENOMEM to > > > the processes that are generating the memory demand? > > > > > > Further: when did the oom-killer become the primary method > > > of handling situations when memory allocation needs to fail? > > > __GFP_WAIT does *not* mean memory allocation can't fail - that's what > > > __GFP_NOFAIL means. And none of the page cache allocations use > > > __GFP_NOFAIL, so why aren't we getting an allocation failure before > > > the oom-killer is kicked? > > > > Well, it has been an unwritten rule that GFP_KERNEL allocations for > > low-order (<=PAGE_ALLOC_COSTLY_ORDER) never fail. This is a long ago > > decision which would be tricky to fix now without silently breaking a > > lot of code. Sad... > > Wow. > > We have *always* been told memory allocations are not guaranteed to > succeed, ever, unless __GFP_NOFAIL is set, but that's deprecated and > nobody is allowed to use it any more. > > Lots of code has dependencies on memory allocation making progress > or failing for the system to work in low memory situations. The page > cache is one of them, which means all filesystems have that > dependency. We don't explicitly ask memory allocations to fail, we > *expect* the memory allocation failures will occur in low memory > conditions. We've been designing and writing code with this in mind > for the past 15 years. > > How did we get so far away from the message of "the memory allocator > never guarantees success" that it will never fail to allocate memory > even if it means we livelock the entire system? I think this isn't as much an allocation guarantee as it is based on the thought that once we can't satisfy such low orders anymore the system is so entirely unusable that the only remaining thing to do is to kill processes one by one until the situation is resolved. Hard to say, though, because this has been the behavior for longer than the initial git import of the tree, without any code comment. And yes, it's flawed, because the allocating task looping might be what's holding up progress, as we can see here. > > Nevertheless the caller can prevent from an endless loop by using > > __GFP_NORETRY so this could be used as a workaround. > > That's just a never-ending game of whack-a-mole that we will > continually lose. It's not a workable solution. Agreed. > > The default should be opposite IMO and only those who really > > require some guarantee should use a special flag for that purpose. > > Yup, totally agree. So how about something like the following change? It restricts the allocator's endless OOM killing loop to __GFP_NOFAIL contexts, which are annotated in the callsite and thus easier to review for locks etc. Otherwise, the allocator tries only as long as page reclaim makes progress, the idea being that failures are handled gracefully in the callsites, and page faults restarting automatically anyway. The OOM killing in that case is deferred to the end of the exception handler. Preliminary testing confirms that the system is indeed trying just as hard before OOM killing in the page fault case. However, it doesn't look like all callsites are prepared for failing smaller allocations: [ 55.553822] Out of memory: Kill process 240 (anonstress) score 158 or sacrifice child [ 55.561787] Killed process 240 (anonstress) total-vm:1540044kB, anon-rss:1284068kB, file-rss:468kB [ 55.571083] BUG: unable to handle kernel paging request at 00000000004006bd [ 55.578156] IP: [<00000000004006bd>] 0x4006bd [ 55.582584] PGD c8f3f067 PUD c8f48067 PMD c8f15067 PTE 0 [ 55.588016] Oops: 0014 [#1] SMP [ 55.591337] CPU: 1 PID: 240 Comm: anonstress Not tainted 3.18.0-mm1-00081-gf6137925fc97-dirty #188 [ 55.600435] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./H61M-DGS, BIOS P1.30 05/10/2012 [ 55.610030] task: ffff8802139b9a10 ti: ffff8800c8f64000 task.ti: ffff8800c8f64000 [ 55.617623] RIP: 0033:[<00000000004006bd>] [<00000000004006bd>] 0x4006bd [ 55.624512] RSP: 002b:00007fffd43b7220 EFLAGS: 00010206 [ 55.629901] RAX: 00007f87e6e01000 RBX: 0000000000000000 RCX: 00007f87f64fe25a [ 55.637104] RDX: 00007f879881a000 RSI: 000000005dc00000 RDI: 0000000000000000 [ 55.644331] RBP: 00007fffd43b7240 R08: 00000000ffffffff R09: 0000000000000000 [ 55.651569] R10: 0000000000000022 R11: 0000000000000283 R12: 0000000000400570 [ 55.658796] R13: 00007fffd43b7340 R14: 0000000000000000 R15: 0000000000000000 [ 55.666040] FS: 00007f87f69d1700(0000) GS:ffff88021f280000(0000) knlGS:0000000000000000 [ 55.674221] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 55.680055] CR2: 00007fdd676ad480 CR3: 00000000c8f3e000 CR4: 00000000000407e0 [ 55.687272] [ 55.688780] RIP [<00000000004006bd>] 0x4006bd [ 55.693304] RSP <00007fffd43b7220> [ 55.696850] CR2: 00000000004006bd [ 55.700207] ---[ end trace b9cb4f44f8e47bc3 ]--- [ 55.704903] Kernel panic - not syncing: Fatal exception [ 55.710208] Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffff9fffffff) [ 55.720517] Rebooting in 30 seconds.. Obvious bugs aside, though, the thought of failing order-0 allocations after such a long time is scary... --- >From 0b204ee379aa5502a1c4dce5df51de96448b5163 Mon Sep 17 00:00:00 2001 From: Johannes Weiner <hannes@xxxxxxxxxxx> Date: Mon, 22 Dec 2014 17:16:43 -0500 Subject: [patch] mm: page_alloc: avoid page allocation vs. OOM killing deadlock The page allocator per default does not ever give up on allocations up to order 3, and instead it keeps the allocating task in a loop running direct reclaim and invoking the OOM killer. The assumed reason behind this decade-old behavior is that the system is unusable once orders of such small size start to fail, and the allocator might as well keep killing processes one-by-one until the situation is rectified. However, the allocating task itself might be holding locks that the OOM victim might need to exit, and, to preserve the emergency memory reserves, the OOM killer doesn't move on to the next victim until the first choice has exited. The result is a deadlock between the task that is trying to allocate and the OOM victim that can't be resolved without a third party exiting or volunteering unreclaimable memory. Do not automatically retry for <= order-3 after reclaim stops making forward progress. Callsites should be prepared to handle allocation failures, and OOM kill-looping inside the allocator is restricted to __GFP_NOFAIL allocations. In case of page faults, it's preferrable to fail the inner allocation and keep retrying the entire fault, where the OOM killer is invoked at the end of the exception handler while the task doesn't hold any locks. Not-yet-signed-off-by: Johannes Weiner <hannes@xxxxxxxxxxx> --- mm/page_alloc.c | 45 +++++++++++++-------------------------------- 1 file changed, 13 insertions(+), 32 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index e8f5997c557c..b2054ca2bc74 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2287,9 +2287,7 @@ void warn_alloc_failed(gfp_t gfp_mask, int order, const char *fmt, ...) } static inline int -should_alloc_retry(gfp_t gfp_mask, unsigned int order, - unsigned long did_some_progress, - unsigned long pages_reclaimed) +should_alloc_retry(gfp_t gfp_mask, unsigned long did_some_progress) { /* Do not loop if specifically requested */ if (gfp_mask & __GFP_NORETRY) @@ -2299,30 +2297,8 @@ should_alloc_retry(gfp_t gfp_mask, unsigned int order, if (gfp_mask & __GFP_NOFAIL) return 1; - /* - * Suspend converts GFP_KERNEL to __GFP_WAIT which can prevent reclaim - * making forward progress without invoking OOM. Suspend also disables - * storage devices so kswapd will not help. Bail if we are suspending. - */ - if (!did_some_progress && pm_suspended_storage()) - return 0; - - /* - * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER - * means __GFP_NOFAIL, but that may not be true in other - * implementations. - */ - if (order <= PAGE_ALLOC_COSTLY_ORDER) - return 1; - - /* - * For order > PAGE_ALLOC_COSTLY_ORDER, if __GFP_REPEAT is - * specified, then we retry until we no longer reclaim any pages - * (above), or we've reclaimed an order of pages at least as - * large as the allocation's order. In both cases, if the - * allocation still fails, we stop retrying. - */ - if (gfp_mask & __GFP_REPEAT && pages_reclaimed < (1 << order)) + /* Retry as long as reclaim is making progress */ + if (did_some_progress) return 1; return 0; @@ -2644,7 +2620,6 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, const gfp_t wait = gfp_mask & __GFP_WAIT; struct page *page = NULL; int alloc_flags; - unsigned long pages_reclaimed = 0; unsigned long did_some_progress; enum migrate_mode migration_mode = MIGRATE_ASYNC; bool deferred_compaction = false; @@ -2803,9 +2778,7 @@ retry: goto got_pg; /* Check if we should retry the allocation */ - pages_reclaimed += did_some_progress; - if (should_alloc_retry(gfp_mask, order, did_some_progress, - pages_reclaimed)) { + if (should_alloc_retry(gfp_mask, did_some_progress)) { /* * If we fail to make progress by freeing individual * pages, but the allocation wants us to keep going, @@ -2841,7 +2814,15 @@ retry: } nopage: - warn_alloc_failed(gfp_mask, order, NULL); + /* + * XXX: Warn when initially failing allocations, but don't + * bother when the OOM killer is already running, it makes + * console output unreadable. + */ + if (oom_zonelist_trylock(zonelist, gfp_mask)) { + warn_alloc_failed(gfp_mask, order, NULL); + oom_zonelist_unlock(zonelist, gfp_mask); + } return page; got_pg: if (kmemcheck_enabled) -- 2.2.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>