On Sat 28-05-16 01:24:51, Tetsuo Handa wrote: > Michal Hocko wrote: > > On Fri 27-05-16 22:18:42, Tetsuo Handa wrote: > > > Michal Hocko wrote: > > > > So this is the biggest change to my approach. And I think it is > > > > incorrect because you cannot simply reap the memory when you have active > > > > users of that memory potentially. > > > > > > I don't reap the memory when I have active users of that memory potentially. > > > I do below check. I'm calling wake_oom_reaper() in order to guarantee that > > > oom_reap_task() shall clear TIF_MEMDIE and drop oom_victims. > > > > > > @@ -483,7 +527,7 @@ static bool __oom_reap_task(struct task_struct *tsk) > > > > > > task_unlock(p); > > > > > > - if (!down_read_trylock(&mm->mmap_sem)) { > > > + if (!mm_is_reapable(mm) || !down_read_trylock(&mm->mmap_sem)) { > > > ret = false; > > > goto unlock_oom; > > > } > > > > OK, I've missed this part. So you just defer the decision to the oom > > reaper while I am trying to solve that at oom_kill_process level. > > Right. > > > We could very well do > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > > index bcb6d3b26c94..d9017b8c7300 100644 > > --- a/mm/oom_kill.c > > +++ b/mm/oom_kill.c > > @@ -813,6 +813,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p, > > * memory might be still used. > > */ > > can_oom_reap = false; > > + set_bit(MMF_OOM_REAPED, mm->flags); > > continue; > > } > > if (p->signal->oom_score_adj == OOM_ADJUST_MIN) > > > > with the same result. If you _really_ think that this would make a > > difference I could live with that. But I am highly skeptical this > > matters all that much. > > It matters a lot. There is a "guarantee" difference. > > Maybe this is something like top half and bottom half relationship? > The OOM killer context is the top half and the OOM reaper context is > the bottom half. "The top half always hands over to the bottom half" > can allow the bottom half to recover when something went wrong. > In your approach, the top half might not hand over to the bottom half. But your bottom half would just decide to back off the same way I do here. And as for the bonus your bottom half would have to do the rather costly process iteration to find that out. [...] > > > > Shared with global init is just non > > > > existant problem. Such a system would be crippled enough to not bother. > > > > > > See commit a2b829d95958da20 ("mm/oom_kill.c: avoid attempting to kill init > > > sharing same memory"). > > > > Don't you think that a system where the largest memory consumer is the > > global init is crippled terribly? > > Why not? > > PID=1 name=init RSS=100MB > \ > +-- PID=102 name=child RSS=10MB (completed execve("child") 10 minutes ago) > \ > +-- PID=103 name=child RSS=10MB (completed execve("child") 7 minutes ago) > \ > +-- PID=104 name=child RSS=10MB (completed execve("child") 3 minutes ago) > \ > +-- PID=105 name=init RSS=100MB (about to start execve("child") from vfork()) > > should be allowed, doesn't it? Killing the vforked task doesn't make any sense because it will be quite unlikely to free any memory. We should select from the any other tasks which have completed their vfork. > There is no reason to exclude vfork()'ed child from OOM victim candidates. > We can't control how people run their userspace programs. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>