Re: [PATCH] mm,oom: Re-enable OOM killer using timers.

Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx> · Fri, 22 Jan 2016 22:59:10 +0900

David Rientjes wrote:
> On Thu, 21 Jan 2016, Tetsuo Handa wrote:
> 
> > I consider phases for managing system-wide OOM events as follows.
> > 
> >   (1) Design and use a system with appropriate memory capacity in mind.
> > 
> >   (2) When (1) failed, the OOM killer is invoked. The OOM killer selects
> >       an OOM victim and allow that victim access to memory reserves by
> >       setting TIF_MEMDIE to it.
> > 
> >   (3) When (2) did not solve the OOM condition, start allowing all tasks
> >       access to memory reserves by your approach.
> > 
> >   (4) When (3) did not solve the OOM condition, start selecting more OOM
> >       victims by my approach.
> > 
> >   (5) When (4) did not solve the OOM condition, trigger the kernel panic.
> > 
> 
> This was all mentioned previously, and I suggested that the panic only 
> occur when memory reserves have been depleted, otherwise there is still 
> the potential for the livelock to be solved.  That is a patch that would 
> apply today, before any of this work, since we never want to loop 
> endlessly in the page allocator when memory reserves are fully depleted.
> 
> This is all really quite simple.
> 

So, David is OK with above approach, right?
Then, Michal and Johannes, are you OK with above approach?



What I'm not sure about above approach are handling of !__GFP_NOFAIL &&
!__GFP_FS allocation requests and use of ALLOC_NO_WATERMARKS without
TIF_MEMDIE.

Basically, we want to make small allocation requests success unless
__GFP_NORETRY is given. Currently such allocation requests do not fail
unless TIF_MEMDIE is given by the OOM killer. But how hard do we want to
continue looping when we reach (3) by timeout for waiting for TIF_MEMDIE
task at (2) expires?

Should we give up waiting for TIF_MEMDIE task and make !__GFP_NOFAIL allocation
requests fail (as with OOM condition after oom_killer_disable() is called)?
If our answer is "yes", there is no need to open the memory reserves.
Therefore, I guess our answer is "no".

Now, we open the memory reserves at (3). Since currently !__GFP_NOFAIL &&
!__GFP_FS allocation requests do not call out_of_memory(), current version
of "mm, oom: add global access to memory reserves on livelock" does not
allow such allocation requests access to memory reserves on OOM livelock.

If the cause of OOM livelock is an OOM victim is waiting for a lock which is
held by somebody else which is doing !__GFP_NOFAIL && !__GFP_FS allocation
requests to be released, we will fully deplete memory reserves because only
__GFP_NOFAIL || __GFP_FS allocation requests (e.g. page fault by memory hog
processes) can access memory reserves. To handle this case, what do we want
to do?

Should we allow !__GFP_NOFAIL && !__GFP_FS allocation requests access to
memory reserves by allowing them to call out_of_memory() in order to avoid
needlessly deplete memory reserves? Or, we don't care at all because we
can reach (4) anyway?

----------

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6463426..2299374 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2743,16 +2743,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 		/* The OOM killer does not needlessly kill tasks for lowmem */
 		if (ac->high_zoneidx < ZONE_NORMAL)
 			goto out;
-		/* The OOM killer does not compensate for IO-less reclaim */
-		if (!(gfp_mask & __GFP_FS)) {
-			/*
-			 * XXX: Page reclaim didn't yield anything,
-			 * and the OOM killer can't be invoked, but
-			 * keep looping as per tradition.
-			 */
-			*did_some_progress = 1;
-			goto out;
-		}
 		if (pm_suspended_storage())
 			goto out;
 		/* The OOM killer may not free memory on a specific node */
----------

Regardless of our answer, we need to decide whether to continue looping, for
there is no guarantee that memory reserves is sufficient to solve OOM livelock.

Should we give up waiting for TIF_MEMDIE task and make !__GFP_NOFAIL allocation
requests fail (as if TIF_MEMDIE was already given by the OOM killer because
we used ALLOC_NO_WATERMARKS)?
If our answer is "yes", there is no need to choose next OOM victim.
Therefore, I guess our answer is "no".

Regardless of our answer, we need to prepare for reaching (4), for it might be
__GFP_NOFAIL allocation request. What is the requirement for choosing next OOM
victim at (4)? An allocating task sees a TIF_MEMDIE task again after
get_page_from_freelist(ALLOC_NO_WATERMARKS) failed after timeout for waiting for
that task at (2) expires? Then, it would kill all tasks immediately because
reaping OOM victim's memory needs some time. We will want to check for another
timeout.



Finally, we will automatically reach (5) after all OOM-killable tasks are
chosen as OOM victims at (4). Here is just an idea for (5). If we change
the OOM killer not to call panic() when there is no more OOM-killable tasks,
it will allow us not to give up immediately after TIF_MEMDIE was given by the
OOM killer. This will increase possibility of making small allocation requests
by TIF_MEMDIE tasks success, for it is almost impossibly unlikely case that
we reach (5).

----------
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 6ebc0351..de22c44 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -884,8 +884,10 @@ bool out_of_memory(struct oom_control *oc)
 	}
 
 	p = select_bad_process(oc, &points, totalpages);
-	/* Found nothing?!?! Either we hang forever, or we panic. */
+	/* Found nothing?!?! Either we fail the allocation, or we panic. */
 	if (!p && !is_sysrq_oom(oc)) {
+		if (!(oc->gfp_mask & __GFP_NOFAIL))
+			return false;
 		dump_header(oc, NULL, NULL);
 		panic("Out of memory and no killable processes...\n");
 	}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6463426..798fd68 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3181,10 +3171,6 @@ retry:
 		goto nopage;
 	}
 
-	/* Avoid allocations with no watermarks from looping endlessly */
-	if (test_thread_flag(TIF_MEMDIE) && !(gfp_mask & __GFP_NOFAIL))
-		goto nopage;
-
 	/*
 	 * Try direct compaction. The first pass is asynchronous. Subsequent
 	 * attempts after direct reclaim are synchronous
----------

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>