Re: [PATCH] mm,oom: use per signal_struct flag rather than clear TIF_MEMDIE

Michal Hocko <mhocko@xxxxxxxxxx> · Mon, 27 Jun 2016 11:23:26 +0200

On Sat 25-06-16 14:44:39, Tetsuo Handa wrote:
> Oleg Nesterov wrote:
> > Since I mentioned TIF_MEMDIE in another thread, I simply can't resist.
> > Sorry for grunting.
> > 
> > On 06/24, Tetsuo Handa wrote:
> > >
> > > --- a/include/linux/sched.h
> > > +++ b/include/linux/sched.h
> > > @@ -801,6 +801,7 @@ struct signal_struct {
> > >  	 * oom
> > >  	 */
> > >  	bool oom_flag_origin;
> > > +	bool oom_ignore_victims;        /* Ignore oom_victims value */
> > >  	short oom_score_adj;		/* OOM kill score adjustment */
> > >  	short oom_score_adj_min;	/* OOM kill score adjustment min value.
> > >  					 * Only settable by CAP_SYS_RESOURCE. */
> > 
> > Yet another kludge to fix yet another problem with TIF_MEMDIE. Not
> > to mention that that wh
> > 
> > Can't we state the fact TIF_MEMDIE is just broken? The very idea imo.
> 
> Yes. TIF_MEMDIE is a trouble maker.
> 
> Setting TIF_MEMDIE is per task_struct operation.
> Sending SIGKILL is per signal_struct operation.
> OOM killer is per mm_struct operation.

Yes this is really unfortunate. I am trying to converge to per mm
behavior as much as possible. We are getting there slowly but not yet
there.

[...]
> > Just one question. Why do we need this bit outside of oom-kill.c? It
> > affects page_alloc.c and this probably makes sense. But who get this
> > flag when we decide to kill the memory hog? A random thread foung by
> > find_lock_task_mm(), iow a random thread with ->mm != NULL, likely the
> > group leader. This simply can not be right no matter what.
> 
> I agree that setting TIF_MEMDIE to only first ->mm != NULL thread
> does not make sense.

Well the idea was that other threads will get TIF_MEMDIE if they need to
allocate and the initial thread (usually the group leader) will hold off
any other oom killing until it gets past its mmput. So the flag acts
both as memory reserve access key and the exclusion. I am not sure
setting the flag to all threads in the same thread group would help all
that much. Processes sharing the mm outside of the thread group should
behave in a similar way. The general reluctance to give access to all
threads was to prevent from thundering herd effect which is more likely
that way.

[...]

> > And in any case I don't understand this patch but I have to admit that
> > I failed to force myself to read the changelog and the actual change ;)
> > In any case I agree that we should not set MMF_MEMDIE if ->mm == NULL,
> > and if we ensure this then I do not understand why we can't rely on
> > MMF_OOM_REAPED. Ignoring the obvious races, if ->oom_victims != 0 then
> > find_lock_task_mm() should succed.
> 
> Since we are using
> 
>   mm = current->mm;
>   current->mm = NULL;
>   __mmput(mm); (may block for unbounded period waiting for somebody else's memory allocation)
>   exit_oom_victim(current);
> 
> sequence, we won't be able to make find_lock_task_mm(tsk) != NULL when
> tsk->signal->oom_victims != 0 unless we change this sequence.
> My patch tries to rescue it using tsk->signal->oom_ignore_victims flag.

I was thinking about this some more and I think that a better approach
would be to not forget the mm during the exit. The whole find_lock_task_mm
sounds like a workaround than a real solution. I am trying to understand
why do we really have to reset the current->mm to NULL during the exit.
If we cannot change this then we can at least keep a stable mm
somewhere. The code would get so much easier that way.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>