Re: [patch] mm, oom: avoid checking set of allowed nodes twice when selecting a victim

David Rientjes <rientjes@xxxxxxxxxx> · Wed, 25 Apr 2012 13:59:28 -0700 (PDT)

On Wed, 25 Apr 2012, Michal Hocko wrote:

> > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > > index 46bf2ed5..a9df008 100644
> > > --- a/mm/oom_kill.c
> > > +++ b/mm/oom_kill.c
> > > @@ -171,23 +171,10 @@ static bool oom_unkillable_task(struct task_struct *p,
> > >  	return false;
> > >  }
> > >  
> > > -/**
> > > - * oom_badness - heuristic function to determine which candidate task to kill
> > > - * @p: task struct of which task we should calculate
> > > - * @totalpages: total present RAM allowed for page allocation
> > > - *
> > > - * The heuristic for determining which task to kill is made to be as simple and
> > > - * predictable as possible.  The goal is to return the highest value for the
> > > - * task consuming the most memory to avoid subsequent oom failures.
> > > - */
> > > -unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
> > > +/* can be used only for tasks which are killable as per oom_unkillable_task */
> > > +static unsigned int __oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
> > >  		      const nodemask_t *nodemask, unsigned long totalpages)
> > >  {
> > > -	long points;
> > > -
> > > -	if (oom_unkillable_task(p, memcg, nodemask))
> > > -		return 0;
> > > -
> > >  	p = find_lock_task_mm(p);
> > >  	if (!p)
> > >  		return 0;
> > > @@ -239,6 +226,26 @@ unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
> > >  	return (points < 1000) ? points : 1000;
> > >  }
> > >  
> > > +/**
> > > + * oom_badness - heuristic function to determine which candidate task to kill
> > > + * @p: task struct of which task we should calculate
> > > + * @totalpages: total present RAM allowed for page allocation
> > > + *
> > > + * The heuristic for determining which task to kill is made to be as simple and
> > > + * predictable as possible.  The goal is to return the highest value for the
> > > + * task consuming the most memory to avoid subsequent oom failures.
> > > + */
> > > +unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
> > > +		      const nodemask_t *nodemask, unsigned long totalpages)
> > > +{
> > > +	long points;
> > > +
> > > +	if (oom_unkillable_task(p, memcg, nodemask))
> > > +		return 0;
> > > +
> > > +	return __oom_badness(p, memcg, nodemask, totalpages);
> > > +}
> > > +
> > >  /*
> > >   * Determine the type of allocation constraint.
> > >   */
> > > @@ -366,7 +373,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
> > >  			}
> > >  		}
> > >  
> > > -		points = oom_badness(p, memcg, nodemask, totalpages);
> > > +		points = __oom_badness(p, memcg, nodemask, totalpages);
> > >  		if (points > *ppoints) {
> > >  			chosen = p;
> > >  			*ppoints = points;
> > 
> > No, the way I had it written is correct: the above unnecessarily checks 
> > for membership in a memcg or intersection with a set of allowable nodes 
> > for child threads in oom_kill_process().  
> 
> your patch does 
> 	if (oom_unkillable_task(child, memcg, nodemask))
> 		continue;
> 	oom_badness((child, memcg, nodemask,
> 				   totalpages);
> 
> in oom_kill_process so the check is very same. Or am I missing
> something?
> 

Here you go again.

Why would you ever do something like this?

/proc/pid/oom_score certainly doesn't care about cpusets or memcg and 
exports only oom scores in a global context, anything else would be 
inconsistent.  It only cares about whether the thread is init or another 
kthread because they are ineligible.  So let's leave /proc/pid/oom_score 
out of this.

That's the function of oom_badness(): to assign a point value for a 
specific process to determine the highest priority for oom kill.  It 
shouldn't care about the context of the oom kill; and that's why 
/proc/pid/oom_score, which is always global, doesn't care.

Now tell me what's clearer in terms of the code: calling 
oom_unkillable_task() to determine the context of the oom kill explicitly 
where it matters or calling either oom_badness() or __oom_badness() and 
remembering what the heck the difference between the two is.

You're patch also wouldn't compile because you've removed the declaration 
of "points" from __oom_badness(), which actually uses it, to 
oom_badness(), which doesn't use it, for no apparent reason.

If this sounds frustrated, then it certainly is.  This patch has stalled 
out three weeks from being in -mm because of your broken patch suggestion.  
My patch fixes an issue that people with very large systems and a high 
CONFIG_NODES_SHIFT will encounter and possibly cause timeouts and we run 
with it internally as part of a fix for a faster oom killer because this 
problem actually manifests itself in real-world situations.

And here you are, just like when you wanted to rework a patch of mine and 
rewrite the changelog so Andrew mistakenly sent it to Linus as a fix for a 
patch that wasn't even in his tree, suggesting broken (and admittedly 
untested) patches as some kind of cleanup that actually just makes the 
code harder to understand if you're reading it.

This patch fixes a real-world issue that has been tested on thousands of 
machines.  Please keep your little untested follow-up changes that you 
think make it look better to yourself so that this patch can get merged 
and help people out.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>