Michal Hocko wrote: > On Thu 14-04-16 20:34:18, Tetsuo Handa wrote: > > Michal Hocko wrote: > [...] > > > The patch seems correct I just do not see any point in it because I do > > > not think it handles any real life situation. I basically consider any > > > workload where only _certain_ thread(s) or process(es) sharing the mm have > > > OOM_SCORE_ADJ_MIN set as invalid. Why should we care about those? This > > > requires root to cripple the system. Or am I missing a valid > > > configuration where this would make any sense? > > > > Because __oom_reap_task() as of current linux.git marks only one of > > thread groups as OOM_SCORE_ADJ_MIN and happily disables further reaping > > (which I'm utilizing such behavior for catching bugs which occur under > > almost OOM situation). > > I am not really sure I understand what you mean here. Let me try. You > have N tasks sharing the same mm. OOM killer selects one of them and > kills it, grants TIF_MEMDIE and schedules it for oom_reaper. Now the oom > reaper handles that task and marks it OOM_SCORE_ADJ_MIN. Others will > have fatal_signal_pending without OOM_SCORE_ADJ_MIN. The shared mm was > already reaped so there is not much left we can do about it. What now? You finally understood what I mean here. Say, there are TG1 and TG2 sharing the same mm which are marked as OOM_SCORE_ADJ_MAX. First round of the OOM killer selects TG1 and sends SIGKILL to TG1 and TG2. The OOM reaper reaps memory via TG1 and marks TG1 as OOM_SCORE_ADJ_MIN and revokes TIF_MEMDIE from TG1. Then, next round of the OOM killer selects TG2 and sends SIGKILL to TG1 and TG2. But since TG1 is already marked as OOM_SCORE_ADJ_MIN by the OOM reaper, the OOM reaper is not called. This is a situation which the patch you show below will solve. > > A different question is whether it makes any sense to pick a task with > oom reaped mm as a new victim. This would happen if either the memory > is not reapable much or the mm was quite small. I agree that we do not > handle this case now same as we haven't before. An mm specific flag > would handle that I believe. Something like the following. Is this what > you are worried about or am I still missing your point? > --- > diff --git a/include/linux/sched.h b/include/linux/sched.h > index acfc32b30704..7bd0fa9db199 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -512,6 +512,7 @@ static inline int get_dumpable(struct mm_struct *mm) > > #define MMF_HAS_UPROBES 19 /* has uprobes */ > #define MMF_RECALC_UPROBES 20 /* MMF_HAS_UPROBES can be wrong */ > +#define MMF_OOM_REAPED 21 /* mm has been already reaped */ > > #define MMF_INIT_MASK (MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK) > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index 716759e3eaab..d5a4d08f2031 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -286,6 +286,13 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc, > return OOM_SCAN_CONTINUE; > > /* > + * mm of this task has already been reaped so it doesn't make any > + * sense to select it as a new oom victim. > + */ > + if (test_bit(MMF_OOM_REAPED, &task->mm->flags)) You checked for task->mm != NULL at previous line but nothing prevents that task from setting task->mm = NULL before arriving at this line. > + return OOM_SCAN_CONTINUE; > + > + /* > * If task is allocating a lot of memory and has been marked to be > * killed first if it triggers an oom, then select it. > */ > @@ -513,7 +520,7 @@ static bool __oom_reap_task(struct task_struct *tsk) > * This task can be safely ignored because we cannot do much more > * to release its memory. > */ > - tsk->signal->oom_score_adj = OOM_SCORE_ADJ_MIN; > + test_bit(MMF_OOM_REAPED, &mm->flags); > out: > mmput(mm); > return ret; Michal Hocko wrote: > On Thu 14-04-16 14:01:06, Michal Hocko wrote: > [...] > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > > index 716759e3eaab..d5a4d08f2031 100644 > > --- a/mm/oom_kill.c > > +++ b/mm/oom_kill.c > > @@ -286,6 +286,13 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc, > > return OOM_SCAN_CONTINUE; > > > > /* > > + * mm of this task has already been reaped so it doesn't make any > > + * sense to select it as a new oom victim. > > + */ > > + if (test_bit(MMF_OOM_REAPED, &task->mm->flags)) > > + return OOM_SCAN_CONTINUE; > > This will have to move to oom_badness to where we check for > OOM_SCORE_ADJ_MIN to catch the case where we try to sacrifice a child... oom_badness() should return 0 if MMF_OOM_REAPED is set (please be careful with race task->mm becoming NULL). But oom_scan_process_thread() should not return OOM_SCAN_ABORT if one of threads in TG1 or TG2 still has TIF_MEMDIE (because it is possible that one of threads in TG1 or TG2 gets TIF_MEMDIE via the fatal_signal_pending(current) shortcut in out_of_memory()). > > In the meantime I have generated a full patch and will repost it with > other oom reaper follow ups sometimes next week. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>