On Thu 14-04-16 23:01:41, Tetsuo Handa wrote: > Michal Hocko wrote: > > On Thu 14-04-16 20:34:18, Tetsuo Handa wrote: > > > Michal Hocko wrote: > > [...] > > > > The patch seems correct I just do not see any point in it because I do > > > > not think it handles any real life situation. I basically consider any > > > > workload where only _certain_ thread(s) or process(es) sharing the mm have > > > > OOM_SCORE_ADJ_MIN set as invalid. Why should we care about those? This > > > > requires root to cripple the system. Or am I missing a valid > > > > configuration where this would make any sense? > > > > > > Because __oom_reap_task() as of current linux.git marks only one of > > > thread groups as OOM_SCORE_ADJ_MIN and happily disables further reaping > > > (which I'm utilizing such behavior for catching bugs which occur under > > > almost OOM situation). > > > > I am not really sure I understand what you mean here. Let me try. You > > have N tasks sharing the same mm. OOM killer selects one of them and > > kills it, grants TIF_MEMDIE and schedules it for oom_reaper. Now the oom > > reaper handles that task and marks it OOM_SCORE_ADJ_MIN. Others will > > have fatal_signal_pending without OOM_SCORE_ADJ_MIN. The shared mm was > > already reaped so there is not much left we can do about it. What now? > > You finally understood what I mean here. OK, good to know we are on the same page. > Say, there are TG1 and TG2 sharing the same mm which are marked as > OOM_SCORE_ADJ_MAX. First round of the OOM killer selects TG1 and sends > SIGKILL to TG1 and TG2. The OOM reaper reaps memory via TG1 and marks > TG1 as OOM_SCORE_ADJ_MIN and revokes TIF_MEMDIE from TG1. Then, next > round of the OOM killer selects TG2 and sends SIGKILL to TG1 and TG2. > But since TG1 is already marked as OOM_SCORE_ADJ_MIN by the OOM reaper, > the OOM reaper is not called. which doesn't matter because this mm has already been reaped and further attempts are basically deemed to fail as well. This is the reason why I completely failed to see your point previously. Because it is not the oom reaper which makes the situation worse. We just never cared about this possible case. > This is a situation which the patch you show below will solve. OK, great. [...] > Michal Hocko wrote: > > On Thu 14-04-16 14:01:06, Michal Hocko wrote: > > [...] > > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > > > index 716759e3eaab..d5a4d08f2031 100644 > > > --- a/mm/oom_kill.c > > > +++ b/mm/oom_kill.c > > > @@ -286,6 +286,13 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc, > > > return OOM_SCAN_CONTINUE; > > > > > > /* > > > + * mm of this task has already been reaped so it doesn't make any > > > + * sense to select it as a new oom victim. > > > + */ > > > + if (test_bit(MMF_OOM_REAPED, &task->mm->flags)) > > > + return OOM_SCAN_CONTINUE; > > > > This will have to move to oom_badness to where we check for > > OOM_SCORE_ADJ_MIN to catch the case where we try to sacrifice a child... > > oom_badness() should return 0 if MMF_OOM_REAPED is set (please be careful > with race task->mm becoming NULL). This is what I ended up with. Patch below for the reference. I plan to repost with other 3 posted recently in one series sometimes next week hopefully. > But oom_scan_process_thread() should not > return OOM_SCAN_ABORT if one of threads in TG1 or TG2 still has TIF_MEMDIE > (because it is possible that one of threads in TG1 or TG2 gets TIF_MEMDIE > via the fatal_signal_pending(current) shortcut in out_of_memory()). This would be a separate patch again. I still have to think how to deal with this case but the most straightforward thing to do would be to simply disable those shortcuts for crosss process shared mm-s. They are just too weird and I do not think we want to support all the potential corner cases and dropping an optimistic heuristic in the name of overal sanity sounds as a good compromise to me. --- >From 700037c19c9e713fdf248a687cad7f3d859d4d6d Mon Sep 17 00:00:00 2001 From: Michal Hocko <mhocko@xxxxxxxx> Date: Thu, 14 Apr 2016 14:02:10 +0200 Subject: [PATCH] mm, oom_reaper: hide oom reaped tasks from OOM killer more carefully 36324a990cf5 ("oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space") not only clears TIF_MEMDIE for oom reaped task but also set OOM_SCORE_ADJ_MIN for the target task to hide it from the oom killer. This works in simple cases but it is not sufficient for (unlikely) cases where the mm is shared between independent processes (as they do not share signal struct). If the mm had only small amount of memory which could be reaped then another task sharing the mm could be selected and that wouldn't help to move out from the oom situation. Introduce MMF_OOM_REAPED mm flag which is checked in oom_badness (same as OOM_SCORE_ADJ_MIN) and task is skipped if the flag is set. Set the flag after __oom_reap_task is done with a task. This will force the select_bad_process() to ignore all already oom reaped tasks as well as no such task is sacrificed for its parent. Signed-off-by: Michal Hocko <mhocko@xxxxxxxx> --- include/linux/sched.h | 1 + mm/oom_kill.c | 9 +++++++-- 2 files changed, 8 insertions(+), 2 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index acfc32b30704..7bd0fa9db199 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -512,6 +512,7 @@ static inline int get_dumpable(struct mm_struct *mm) #define MMF_HAS_UPROBES 19 /* has uprobes */ #define MMF_RECALC_UPROBES 20 /* MMF_HAS_UPROBES can be wrong */ +#define MMF_OOM_REAPED 21 /* mm has been already reaped */ #define MMF_INIT_MASK (MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 716759e3eaab..98d38e295883 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -174,8 +174,13 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg, if (!p) return 0; + /* + * Do not even consider tasks which are explicitly marked oom + * unkillable or have been already oom reaped. + */ adj = (long)p->signal->oom_score_adj; - if (adj == OOM_SCORE_ADJ_MIN) { + if (adj == OOM_SCORE_ADJ_MIN || + test_bit(MMF_OOM_REAPED, &p->mm->flags)) { task_unlock(p); return 0; } @@ -513,7 +518,7 @@ static bool __oom_reap_task(struct task_struct *tsk) * This task can be safely ignored because we cannot do much more * to release its memory. */ - tsk->signal->oom_score_adj = OOM_SCORE_ADJ_MIN; + test_bit(MMF_OOM_REAPED, &mm->flags); out: mmput(mm); return ret; -- 2.8.0.rc3 -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>