Re: [PATCH] mm: mempolicy: don't select exited threads as OOM victims

Michal Hocko <mhocko@xxxxxxxxxx> · Mon, 1 Jul 2019 16:16:47 +0200

On Mon 01-07-19 16:04:34, Michal Hocko wrote:
> On Mon 01-07-19 22:56:12, Tetsuo Handa wrote:
> > On 2019/07/01 22:48, Michal Hocko wrote:
> > > On Mon 01-07-19 22:38:58, Tetsuo Handa wrote:
> > >> On 2019/07/01 22:17, Michal Hocko wrote:
> > >>> On Mon 01-07-19 22:04:22, Tetsuo Handa wrote:
> > >>>> But I realized that this patch was too optimistic. We need to wait for mm-less
> > >>>> threads until MMF_OOM_SKIP is set if the process was already an OOM victim.
> > >>>
> > >>> If the process is an oom victim then _all_ threads are so as well
> > >>> because that is the address space property. And we already do check that
> > >>> before reaching oom_badness IIRC. So what is the actual problem you are
> > >>> trying to solve here?
> > >>
> > >> I'm talking about behavioral change after tsk became an OOM victim.
> > >>
> > >> If tsk->signal->oom_mm != NULL, we have to wait for MMF_OOM_SKIP even if
> > >> tsk->mm == NULL. Otherwise, the OOM killer selects next OOM victim as soon as
> > >> oom_unkillable_task() returned true because has_intersects_mems_allowed() returned
> > >> false because mempolicy_nodemask_intersects() returned false because all thread's
> > >> mm became NULL (despite tsk->signal->oom_mm != NULL).
> > > 
> > > OK, I finally got your point. It was not clear that you are referring to
> > > the code _after_ the patch you are proposing. You are indeed right that
> > > this would have a side effect that an additional victim could be
> > > selected even though the current process hasn't terminated yet. Sigh,
> > > another example how the whole thing is subtle so I retract my Ack and
> > > request a real life example of where this matters before we think about
> > > a proper fix and make the code even more complex.
> > > 
> > 
> > Instead of checking for mm != NULL, can we move mpol_put_task_policy() from
> > do_exit() to __put_task_struct() ? That change will (if it is safe to do)
> > prevent exited threads from setting mempolicy = NULL (and confusing
> > mempolicy_nodemask_intersects() due to mempolicy == NULL).
> 
> I am sorry but I would have to study it much more and I am not convinced
> the time spent on it would be well spent.

Thinking about it some more it seems that we can go with your original
fix if we also reorder oom_evaluate_task

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f719b64741d6..e5feb0f72e3b 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -318,9 +318,6 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
 	struct oom_control *oc = arg;
 	unsigned long points;
 
-	if (oom_unkillable_task(task, NULL, oc->nodemask))
-		goto next;
-
 	/*
 	 * This task already has access to memory reserves and is being killed.
 	 * Don't allow any other task to have access to the reserves unless
@@ -333,6 +330,9 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
 		goto abort;
 	}
 
+	if (oom_unkillable_task(task, NULL, oc->nodemask))
+		goto next;
+
 	/*
 	 * If task is allocating a lot of memory and has been marked to be
 	 * killed first if it triggers an oom, then select it.

I do not see any strong reason to keep the current ordering. OOM victim
check is trivial so it shouldn't add a visible overhead for few
unkillable tasks that we might encounter.
-- 
Michal Hocko
SUSE Labs