Re: [PATCH] mm,oom: Don't call schedule_timeout_killable() with oom_lock held.

Michal Hocko <mhocko@xxxxxxxxxx> · Mon, 28 May 2018 14:43:13 +0200

On Fri 25-05-18 20:46:21, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Fri 25-05-18 19:57:32, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > What is wrong with the folliwing? should_reclaim_retry should be a
> > > > natural reschedule point. PF_WQ_WORKER is a special case which needs a
> > > > stronger rescheduling policy. Doing that unconditionally seems more
> > > > straightforward than depending on a zone being a good candidate for a
> > > > further reclaim.
> > > 
> > > Where is schedule_timeout_uninterruptible(1) for !PF_KTHREAD threads?
> > 
> > Re-read what I've said.
> 
> Please show me as a complete patch. Then, I will test your patch.

So how about we start as simple as the following? If we _really_ need to
touch should_reclaim_retry then it should be done in a separate patch
with some numbers/tracing data backing that story.
---
>From fa3b165c34937bf628f2eee7e6f0483c611167d0 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@xxxxxxxx>
Date: Mon, 28 May 2018 14:30:37 +0200
Subject: [PATCH] mm, oom: remove sleep from under oom_lock

Tetsuo has pointed out that since 27ae357fa82b ("mm, oom: fix concurrent
munlock and oom reaper unmap, v3") we have a strong synchronization
between the oom_killer and victim's exiting because both have to take
the oom_lock. Therefore the original heuristic to sleep for a short time
in out_of_memory doesn't serve the original purpose.

Moreover Tetsuo has noticed that the short sleep can be more harmful
than actually useful. Hammering the system with many processes can lead
to a starvation when the task holding the oom_lock can block for a
long time (minutes) and block any further progress because the
oom_reaper depends on the oom_lock as well.

Drop the short sleep from out_of_memory when we hold the lock. Keep the
sleep when the trylock fails to throttle the concurrent OOM paths a bit.
This should be solved in a more reasonable way (e.g. sleep proportional
to the time spent in the active reclaiming etc.) but this is much more
complex thing to achieve. This is a quick fixup to remove a stale code.

Reported-by: Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx>
Signed-off-by: Michal Hocko <mhocko@xxxxxxxx>
---
 mm/oom_kill.c | 17 ++---------------
 1 file changed, 2 insertions(+), 15 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 565e7da55318..7f12e346e477 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1099,7 +1099,6 @@ bool out_of_memory(struct oom_control *oc)
 {
 	unsigned long freed = 0;
 	enum oom_constraint constraint = CONSTRAINT_NONE;
-	bool delay = false; /* if set, delay next allocation attempt */
 
 	if (oom_killer_disabled)
 		return false;
@@ -1149,10 +1148,8 @@ bool out_of_memory(struct oom_control *oc)
 		return true;
 	}
 
-	if (mem_cgroup_select_oom_victim(oc) && oom_kill_memcg_victim(oc)) {
-		delay = true;
+	if (mem_cgroup_select_oom_victim(oc) && oom_kill_memcg_victim(oc))
 		goto out;
-	}
 
 	select_bad_process(oc);
 	/* Found nothing?!?! Either we hang forever, or we panic. */
@@ -1160,20 +1157,10 @@ bool out_of_memory(struct oom_control *oc)
 		dump_header(oc, NULL);
 		panic("Out of memory and no killable processes...\n");
 	}
-	if (oc->chosen_task && oc->chosen_task != INFLIGHT_VICTIM) {
+	if (oc->chosen_task && oc->chosen_task != INFLIGHT_VICTIM)
 		oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" :
 				 "Memory cgroup out of memory");
-		delay = true;
-	}
-
 out:
-	/*
-	 * Give the killed process a good chance to exit before trying
-	 * to allocate memory again.
-	 */
-	if (delay)
-		schedule_timeout_killable(1);
-
 	return !!oc->chosen_task;
 }
 
-- 
2.17.0


-- 
Michal Hocko
SUSE Labs