Re: [v6 2/4] mm, oom: cgroup-aware OOM killer

Michal Hocko <mhocko@xxxxxxxxxx> · Fri, 25 Aug 2017 10:14:03 +0200

On Thu 24-08-17 15:58:01, Roman Gushchin wrote:
> On Thu, Aug 24, 2017 at 04:13:37PM +0200, Michal Hocko wrote:
> > On Thu 24-08-17 14:58:42, Roman Gushchin wrote:
[...]
> > > Both ways are not ideal, and sum of the processes is not ideal too.
> > > Especially, if you take oom_score_adj into account. Will you respect it?
> > 
> > Yes, and I do not see any reason why we shouldn't.
> 
> It makes things even more complicated.
> Right now task's oom_score can be in (~ -total_memory, ~ +2*total_memory) range,
> and it you're starting summing it, it can be multiplied by number of tasks...
> Weird.

oom_score_adj is just a normalized bias so if tasks inside oom will use
it the whole memcg will get accumulated bias from all such tasks so it
is not completely off. I agree that the more tasks use the bias the more
biased the whole memcg will be. This might or might not be a problem.
As you are trying to reimplement the existing oom killer implementation
I do not think we cannot simply ignore API which people are used to.

If this was a configurable oom policy then I could see how ignoring
oom_score_adj is acceptable because it would be an explicit opt-in.

> It also will be different in case of system and memcg-wide OOM.

Why, we do honor oom_score_adj for the memcg OOM now and in fact the
kernel memcg OOM killer shouldn't be very much different from the global
one except for the tasks scope.

> > > I've started actually with such approach, but then found it weird.
> > > 
> > > > Besides that you have
> > > > to check each task for over-killing anyway. So I do not see any
> > > > performance merits here.
> > > 
> > > It's an implementation detail, and we can hopefully get rid of it at some point.
> > 
> > Well, we might do some estimations and ignore oom scopes but I that
> > sounds really complicated and error prone. Unless we have anything like
> > that then I would start from tasks and build up the necessary to make a
> > decision at the higher level.
> 
> Seriously speaking, do you have an example, when summing per-process
> oom_score will work better?

The primary reason I am pushing for this is to have the common iterator
code path (which we have since Vladimir has unified memcg and global oom
paths) and only parametrize the value calculation and victim selection.

> Especially, if we're talking about customizing oom_score calculation,
> it makes no sence to me. How you will sum process timestamps?

Well, I meant you could sum oom_badness for your particular
implementation. If we need some other policy then this wouldn't work and
that's why I've said that I would like to preserve the current common
code and only parametrize value calculation and victim selection...
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe cgroups" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html