Re: [PATCH] mm,oom: Don't call schedule_timeout_killable() with oom_lock held.

Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx> · Fri, 01 Jun 2018 10:21:13 +0900

Michal Hocko wrote:
> On Fri 01-06-18 00:23:57, Tetsuo Handa wrote:
> > On 2018/05/31 19:44, Michal Hocko wrote:
> > > On Thu 31-05-18 19:10:48, Tetsuo Handa wrote:
> > >> On 2018/05/30 8:07, Andrew Morton wrote:
> > >>> On Tue, 29 May 2018 09:17:41 +0200 Michal Hocko <mhocko@xxxxxxxxxx> wrote:
> > >>>
> > >>>>> I suggest applying
> > >>>>> this patch first, and then fix "mm, oom: cgroup-aware OOM killer" patch.
> > >>>>
> > >>>> Well, I hope the whole pile gets merged in the upcoming merge window
> > >>>> rather than stall even more.
> > >>>
> > >>> I'm more inclined to drop it all.  David has identified significant
> > >>> shortcomings and I'm not seeing a way of addressing those shortcomings
> > >>> in a backward-compatible fashion.  Therefore there is no way forward
> > >>> at present.
> > >>>
> > >>
> > >> Can we apply my patch as-is first?
> > > 
> > > No. As already explained before. Sprinkling new sleeps without a strong
> > > reason is not acceptable. The issue you are seeing is pretty artificial
> > > and as such doesn're really warrant an immediate fix. We should rather
> > > go with a well thought trhough fix. In other words we should simply drop
> > > the sleep inside the oom_lock for starter unless it causes some really
> > > unexpected behavior change.
> > > 
> > 
> > The OOM killer did not require schedule_timeout_killable(1) to return
> > as long as the OOM victim can call __mmput(). But now the OOM killer
> > requires schedule_timeout_killable(1) to return in order to allow the
> > OOM victim to call __oom_reap_task_mm(). Thus, this is a regression.
> > 
> > Artificial cannot become the reason to postpone my patch. If we don't care
> > artificialness/maliciousness, we won't need to care Spectre/Meltdown bugs.
> > 
> > I'm not sprinkling new sleeps. I'm just merging existing sleeps (i.e.
> > mutex_trylock() case and !mutex_trylock() case) and updating the outdated
> > comments.
> 
> Sigh. So what exactly is wrong with going simple and do
> http://lkml.kernel.org/r/20180528124313.GC27180@xxxxxxxxxxxxxx ?
> 

Because

  (1) You are trying to apply this fix after Roman's patchset which
      Andrew Morton is more inclined to drop.

  (2) You are making this fix difficult to backport because this
      patch depends on Roman's patchset.

  (3) You are not fixing the bug in Roman's patchset.

  (4) You are not updating the outdated comment in my patch and
      Roman's patchset.

  (5) You are not evaluating the side effect of not sleeping
      outside of the OOM path, despite you said

      > If we _really_ need to touch should_reclaim_retry then
      > it should be done in a separate patch with some numbers/tracing
      > data backing that story.

      and I consider that "whether moving the short sleep to
      should_reclaim_retry() has negative impact" and "whether
      eliminating the short sleep has negative impact" should be
      evaluated in a separate patch.

but I will tolerate below patch if you can accept below patch "as-is"
(because it explicitly notes what actually happens and there might be
unexpected side effect of not sleeping outside of the OOM path).



>From 4f8dbcd2a7edde2dddf4b1bd364bb2fa930b0819 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx>
Date: Fri, 1 Jun 2018 09:35:01 +0900
Subject: [PATCH v2] mm, oom: remove sleep from under oom_lock

Tetsuo has pointed out that since commit 27ae357fa82be5ab ("mm, oom: fix
concurrent munlock and oom reaper unmap, v3") we have a strong
synchronization between the oom_killer and victim's exiting because both
have to take the oom_lock. Therefore the original heuristic to sleep for
a short time in out_of_memory() doesn't serve the original purpose.

Moreover Tetsuo has noticed that the short sleep can be more harmful
than actually useful. Hammering the system with a few dozens of processes
can block the task holding the oom_lock `forever', and the system gets
completely locked-up because neither the OOM victims nor the OOM reaper
can make any progress.

Drop the short sleep from out_of_memory() when we hold the lock (without
ever examining the side effect of not sleeping without the lock). But keep
the short sleep when the mutex_trylock() fails in order to throttle the
concurrent OOM paths a bit. This should be solved in a more reasonable way
(e.g. sleep proportional to the time spent in the active reclaiming etc.)
but this is much more complex thing to achieve. This is a quick fixup to
remove a stale code.

Reported-by: Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx>
Signed-off-by: Michal Hocko <mhocko@xxxxxxxx>
---
 mm/oom_kill.c | 58 ++++++++++++++++++++++++----------------------------------
 1 file changed, 24 insertions(+), 34 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 565e7da..7940c53 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -357,7 +357,8 @@ int oom_evaluate_task(struct task_struct *task, void *arg)
 
 /*
  * Simple selection loop. We choose the process with the highest number of
- * 'points'. In case scan was aborted, oc->chosen_task is set to -1.
+ * 'points'. In case scan was aborted, oc->chosen_task is set to
+ * INFLIGHT_VICTIM.
  */
 static void select_bad_process(struct oom_control *oc)
 {
@@ -499,6 +500,21 @@ bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
 static struct task_struct *oom_reaper_list;
 static DEFINE_SPINLOCK(oom_reaper_lock);
 
+/*
+ * We have to make sure not to cause premature new oom victim selection.
+ *
+ * __alloc_pages_may_oom()     oom_reap_task_mm()/exit_mmap()
+ *   mutex_trylock(&oom_lock)
+ *   get_page_from_freelist(ALLOC_WMARK_HIGH) # fails
+ *                               unmap_page_range() # frees some memory
+ *                               set_bit(MMF_OOM_SKIP)
+ *   out_of_memory()
+ *     select_bad_process()
+ *       test_bit(MMF_OOM_SKIP) # selects new oom victim
+ *   mutex_unlock(&oom_lock)
+ *
+ * Therefore, the callers hold oom_lock when calling this function.
+ */
 void __oom_reap_task_mm(struct mm_struct *mm)
 {
 	struct vm_area_struct *vma;
@@ -543,20 +559,6 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 {
 	bool ret = true;
 
-	/*
-	 * We have to make sure to not race with the victim exit path
-	 * and cause premature new oom victim selection:
-	 * oom_reap_task_mm		exit_mm
-	 *   mmget_not_zero
-	 *				  mmput
-	 *				    atomic_dec_and_test
-	 *				  exit_oom_victim
-	 *				[...]
-	 *				out_of_memory
-	 *				  select_bad_process
-	 *				    # no TIF_MEMDIE task selects new victim
-	 *  unmap_page_range # frees some memory
-	 */
 	mutex_lock(&oom_lock);
 
 	if (!down_read_trylock(&mm->mmap_sem)) {
@@ -1099,7 +1101,6 @@ bool out_of_memory(struct oom_control *oc)
 {
 	unsigned long freed = 0;
 	enum oom_constraint constraint = CONSTRAINT_NONE;
-	bool delay = false; /* if set, delay next allocation attempt */
 
 	if (oom_killer_disabled)
 		return false;
@@ -1149,32 +1150,21 @@ bool out_of_memory(struct oom_control *oc)
 		return true;
 	}
 
-	if (mem_cgroup_select_oom_victim(oc) && oom_kill_memcg_victim(oc)) {
-		delay = true;
-		goto out;
-	}
+	if (mem_cgroup_select_oom_victim(oc) && oom_kill_memcg_victim(oc))
+		return true;
 
 	select_bad_process(oc);
 	/* Found nothing?!?! Either we hang forever, or we panic. */
-	if (!oc->chosen_task && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) {
+	if (!oc->chosen_task) {
+		if (is_sysrq_oom(oc) || is_memcg_oom(oc))
+			return false;
 		dump_header(oc, NULL);
 		panic("Out of memory and no killable processes...\n");
 	}
-	if (oc->chosen_task && oc->chosen_task != INFLIGHT_VICTIM) {
+	if (oc->chosen_task != INFLIGHT_VICTIM)
 		oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" :
 				 "Memory cgroup out of memory");
-		delay = true;
-	}
-
-out:
-	/*
-	 * Give the killed process a good chance to exit before trying
-	 * to allocate memory again.
-	 */
-	if (delay)
-		schedule_timeout_killable(1);
-
-	return !!oc->chosen_task;
+	return true;
 }
 
 /*
-- 
1.8.3.1