Re: cgroup-aware OOM killer, how to move forward

David Rientjes <rientjes@xxxxxxxxxx> · Tue, 17 Jul 2018 13:41:33 -0700 (PDT)

On Tue, 17 Jul 2018, Roman Gushchin wrote:

> > > Let me show my proposal on examples. Let's say we have the following hierarchy,
> > > and the biggest process (or the process with highest oom_score_adj) is in D.
> > > 
> > >   /
> > >   |
> > >   A
> > >   |
> > >   B
> > >  / \
> > > C   D
> > > 
> > > Let's look at different examples and intended behavior:
> > > 1) system-wide OOM
> > >   - default settings: the biggest process is killed
> > >   - D/memory.group_oom=1: all processes in D are killed
> > >   - A/memory.group_oom=1: all processes in A are killed
> > > 2) memcg oom in B
> > >   - default settings: the biggest process is killed
> > >   - A/memory.group_oom=1: the biggest process is killed
> > 
> > Huh? Why would you even consider A here when the oom is below it?
> > /me confused
> 
> I do not.
> This is exactly a counter-example: A's memory.group_oom
> is not considered at all in this case,
> because A is above ooming cgroup.
> 

I think the confusion is that this says A/memory.group_oom=1 and then the 
biggest process is killed, which doesn't seem like it matches the 
description you want to give memory.group_oom.

> > >   - B/memory.group_oom=1: all processes in B are killed
> > 
> >     - B/memory.group_oom=0 &&
> > >   - D/memory.group_oom=1: all processes in D are killed
> > 
> > What about?
> >     - B/memory.group_oom=1 && D/memory.group_oom=0
> 
> All tasks in B are killed.
> 
> Group_oom set to 1 means that the workload can't tolerate
> killing of a random process, so in this case it's better
> to guarantee consistency for B.
> 

This example is missing the usecase that I was referring to, i.e. killing 
all processes attached to a subtree because the limit on a common ancestor 
has been reached.

In your example, I would think that the memory.group_oom setting of /A and 
/A/B are meaningless because there are no processes attached to them.

IIUC, your proposal is to select the victim by whatever means, check the 
memory.group_oom setting of that victim, and then either kill the victim 
or all processes attached to that local mem cgroup depending on the 
setting.

However, if C and D here are only limited by a common ancestor, /A or 
/A/B, there is no way to specify that the subtree itself should be oom 
killed.  That was where I thought a tristate value would be helpful such 
that you can define all processes attached to the subtree should be oom 
killed when a mem cgroup has reached memory.max.

I was purposefully overloading memory.group_oom because the actual value 
of memory.group_oom given your semantics here is not relevant for /A or 
/A/B.  I think an additional memory.group_oom_tree or whatever it would be 
called would lead to unnecessary confusion because then we have a model 
where one tunable means something based on the value of the other.

Given the no internal process constraint of cgroup v2, my suggestion was a 
value, "tree", that could specify that a mem cgroup reaching its limit 
could cause all processes attached to its subtree to be killed.  This is 
required only because the single unified hierarchy of cgroup v2 such that 
we want to bind a subset of processes to be controlled by another 
controller separately but still want all processes oom killed when 
reaching the limit of a common ancestor.

Thus, the semantic would be: if oom mem cgroup is "tree", kill all 
processes in subtree; otherwise, it can be "cgroup" or "process" to 
determine what is oom killed depending on the victim selection.

Having the "tree" behavior could definitely be implemented as a separate 
tunable; but then then value of /A/memory.group_oom and 
/A/B/memory.group_oom are irrelevant and, to me, seems like it would be 
more confusing.