Re: [PATCH] mm,oom_reaper: Use try_oom_reaper() for reapability test.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu 14-04-16 23:01:41, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Thu 14-04-16 20:34:18, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > [...]
> > > > The patch seems correct I just do not see any point in it because I do
> > > > not think it handles any real life situation. I basically consider any
> > > > workload where only _certain_ thread(s) or process(es) sharing the mm have
> > > > OOM_SCORE_ADJ_MIN set as invalid. Why should we care about those? This
> > > > requires root to cripple the system. Or am I missing a valid
> > > > configuration where this would make any sense?
> > > 
> > > Because __oom_reap_task() as of current linux.git marks only one of
> > > thread groups as OOM_SCORE_ADJ_MIN and happily disables further reaping
> > > (which I'm utilizing such behavior for catching bugs which occur under
> > > almost OOM situation).
> > 
> > I am not really sure I understand what you mean here. Let me try. You
> > have N tasks sharing the same mm. OOM killer selects one of them and
> > kills it, grants TIF_MEMDIE and schedules it for oom_reaper. Now the oom
> > reaper handles that task and marks it OOM_SCORE_ADJ_MIN. Others will
> > have fatal_signal_pending without OOM_SCORE_ADJ_MIN. The shared mm was
> > already reaped so there is not much left we can do about it. What now?
> 
> You finally understood what I mean here.

OK, good to know we are on the same page.

> Say, there are TG1 and TG2 sharing the same mm which are marked as
> OOM_SCORE_ADJ_MAX. First round of the OOM killer selects TG1 and sends
> SIGKILL to TG1 and TG2. The OOM reaper reaps memory via TG1 and marks
> TG1 as OOM_SCORE_ADJ_MIN and revokes TIF_MEMDIE from TG1. Then, next
> round of the OOM killer selects TG2 and sends SIGKILL to TG1 and TG2.
> But since TG1 is already marked as OOM_SCORE_ADJ_MIN by the OOM reaper,
> the OOM reaper is not called.

which doesn't matter because this mm has already been reaped and further
attempts are basically deemed to fail as well. This is the reason why I
completely failed to see your point previously. Because it is not the
oom reaper which makes the situation worse. We just never cared about
this possible case.

> This is a situation which the patch you show below will solve.

OK, great.

[...]

> Michal Hocko wrote:
> > On Thu 14-04-16 14:01:06, Michal Hocko wrote:
> > [...]
> > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > > index 716759e3eaab..d5a4d08f2031 100644
> > > --- a/mm/oom_kill.c
> > > +++ b/mm/oom_kill.c
> > > @@ -286,6 +286,13 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
> > >  		return OOM_SCAN_CONTINUE;
> > >  
> > >  	/*
> > > +	 * mm of this task has already been reaped so it doesn't make any
> > > +	 * sense to select it as a new oom victim.
> > > +	 */
> > > +	if (test_bit(MMF_OOM_REAPED, &task->mm->flags))
> > > +		return OOM_SCAN_CONTINUE;
> > 
> > This will have to move to oom_badness to where we check for
> > OOM_SCORE_ADJ_MIN to catch the case where we try to sacrifice a child...
> 
> oom_badness() should return 0 if MMF_OOM_REAPED is set (please be careful
> with race task->mm becoming NULL).

This is what I ended up with. Patch below for the reference. I plan to
repost with other 3 posted recently in one series sometimes next week
hopefully.

> But oom_scan_process_thread() should not
> return OOM_SCAN_ABORT if one of threads in TG1 or TG2 still has TIF_MEMDIE
> (because it is possible that one of threads in TG1 or TG2 gets TIF_MEMDIE
> via the fatal_signal_pending(current) shortcut in out_of_memory()).

This would be a separate patch again. I still have to think how to deal
with this case but the most straightforward thing to do would be to simply
disable those shortcuts for crosss process shared mm-s. They are just
too weird and I do not think we want to support all the potential corner
cases and dropping an optimistic heuristic in the name of overal sanity
sounds as a good compromise to me.

---
>From 700037c19c9e713fdf248a687cad7f3d859d4d6d Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@xxxxxxxx>
Date: Thu, 14 Apr 2016 14:02:10 +0200
Subject: [PATCH] mm, oom_reaper: hide oom reaped tasks from OOM killer more
 carefully

36324a990cf5 ("oom: clear TIF_MEMDIE after oom_reaper managed to unmap
the address space") not only clears TIF_MEMDIE for oom reaped task
but also set OOM_SCORE_ADJ_MIN for the target task to hide it from
the oom killer. This works in simple cases but it is not sufficient
for (unlikely) cases where the mm is shared between independent
processes (as they do not share signal struct). If the mm had only
small amount of memory which could be reaped then another task
sharing the mm could be selected and that wouldn't help to move out from
the oom situation.

Introduce MMF_OOM_REAPED mm flag which is checked in oom_badness (same
as OOM_SCORE_ADJ_MIN) and task is skipped if the flag is set.  Set the
flag after __oom_reap_task is done with a task. This will force the
select_bad_process() to ignore all already oom reaped tasks as well as
no such task is sacrificed for its parent.

Signed-off-by: Michal Hocko <mhocko@xxxxxxxx>
---
 include/linux/sched.h | 1 +
 mm/oom_kill.c         | 9 +++++++--
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index acfc32b30704..7bd0fa9db199 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -512,6 +512,7 @@ static inline int get_dumpable(struct mm_struct *mm)
 
 #define MMF_HAS_UPROBES		19	/* has uprobes */
 #define MMF_RECALC_UPROBES	20	/* MMF_HAS_UPROBES can be wrong */
+#define MMF_OOM_REAPED		21	/* mm has been already reaped */
 
 #define MMF_INIT_MASK		(MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK)
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 716759e3eaab..98d38e295883 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -174,8 +174,13 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
 	if (!p)
 		return 0;
 
+	/*
+	 * Do not even consider tasks which are explicitly marked oom
+	 * unkillable or have been already oom reaped.
+	 */
 	adj = (long)p->signal->oom_score_adj;
-	if (adj == OOM_SCORE_ADJ_MIN) {
+	if (adj == OOM_SCORE_ADJ_MIN ||
+			test_bit(MMF_OOM_REAPED, &p->mm->flags)) {
 		task_unlock(p);
 		return 0;
 	}
@@ -513,7 +518,7 @@ static bool __oom_reap_task(struct task_struct *tsk)
 	 * This task can be safely ignored because we cannot do much more
 	 * to release its memory.
 	 */
-	tsk->signal->oom_score_adj = OOM_SCORE_ADJ_MIN;
+	test_bit(MMF_OOM_REAPED, &mm->flags);
 out:
 	mmput(mm);
 	return ret;
-- 
2.8.0.rc3

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>



[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [ECOS]     [Asterisk Internet PBX]     [Linux API]