Re: cgroup-aware OOM killer, how to move forward

Michal Hocko <mhocko@xxxxxxxxxx> · Mon, 23 Jul 2018 16:17:48 +0200

On Thu 19-07-18 10:05:47, Roman Gushchin wrote:
> On Thu, Jul 19, 2018 at 09:38:43AM +0200, Michal Hocko wrote:
> > On Wed 18-07-18 08:28:50, Roman Gushchin wrote:
> > > On Wed, Jul 18, 2018 at 10:12:30AM +0200, Michal Hocko wrote:
> > > > On Tue 17-07-18 13:06:42, Roman Gushchin wrote:
> > > > > On Tue, Jul 17, 2018 at 09:49:46PM +0200, Michal Hocko wrote:
> > > > > > On Tue 17-07-18 10:38:45, Roman Gushchin wrote:
> > > > > > [...]
> > > > > > > Let me show my proposal on examples. Let's say we have the following hierarchy,
> > > > > > > and the biggest process (or the process with highest oom_score_adj) is in D.
> > > > > > > 
> > > > > > >   /
> > > > > > >   |
> > > > > > >   A
> > > > > > >   |
> > > > > > >   B
> > > > > > >  / \
> > > > > > > C   D
> > > > > > > 
> > > > > > > Let's look at different examples and intended behavior:
> > > > > > > 1) system-wide OOM
> > > > > > >   - default settings: the biggest process is killed
> > > > > > >   - D/memory.group_oom=1: all processes in D are killed
> > > > > > >   - A/memory.group_oom=1: all processes in A are killed
> > > > > > > 2) memcg oom in B
> > > > > > >   - default settings: the biggest process is killed
> > > > > > >   - A/memory.group_oom=1: the biggest process is killed
> > > > > > 
> > > > > > Huh? Why would you even consider A here when the oom is below it?
> > > > > > /me confused
> > > > > 
> > > > > I do not.
> > > > > This is exactly a counter-example: A's memory.group_oom
> > > > > is not considered at all in this case,
> > > > > because A is above ooming cgroup.
> > > > 
> > > > OK, it confused me.
> > > > 
> > > > > > 
> > > > > > >   - B/memory.group_oom=1: all processes in B are killed
> > > > > > 
> > > > > >     - B/memory.group_oom=0 &&
> > > > > > >   - D/memory.group_oom=1: all processes in D are killed
> > > > > > 
> > > > > > What about?
> > > > > >     - B/memory.group_oom=1 && D/memory.group_oom=0
> > > > > 
> > > > > All tasks in B are killed.
> > > > 
> > > > so essentially find a task, traverse the memcg hierarchy from the
> > > > victim's memcg up to the oom root as long as memcg.group_oom = 1?
> > > > If the resulting memcg.group_oom == 1 then kill the whole sub tree.
> > > > Right?
> > > 
> > > Yes.
> > > 
> > > > 
> > > > > Group_oom set to 1 means that the workload can't tolerate
> > > > > killing of a random process, so in this case it's better
> > > > > to guarantee consistency for B.
> > > > 
> > > > OK, but then if D itself is OOM then we do not care about consistency
> > > > all of the sudden? I have hard time to think about a sensible usecase.
> > > 
> > > I mean if traversing the hierarchy up to the oom root we meet
> > > a memcg with group_oom set to 0, we shouldn't stop traversing.
> > 
> > Well, I am still fighting with the semantic of group, no-group, group
> > configuration. Why does it make any sense? In other words when can we
> > consider a cgroup to be a indivisible workload for one oom context while
> > it is fine to lose head or arm from another?
> 
> Hm, so the question is should we traverse up to the OOMing cgroup,
> or up to the first cgroup with memory.group_oom=0?
> 
> I looked at an example, and it *might* be the latter is better,
> especially if we'll make the default value inheritable.
> 
> Let's say we have a sub-tree with a workload and some control stuff.
> Workload is tolerable to OOM's (we can handle it in userspace, for
> example), but the control stuff is not.
> Then it probably makes no sense to kill the entire sub-tree,
> if a task in C has to be killed. But makes perfect sense if we
> have to kill a task in B.
> 
>   /
>   |
>   A, delegated sub-tree, group_oom=1
>  / \
> B   C, workload, group_oom=0
> ^
> some control stuff here, group_oom=1
> 
> Does this makes sense?

I am not sure. If you are going to delegate then you are basically
losing control of the group_oom at A-level. Is this good? What if I
_want_ to tear down the whole thing if it starts misbehaving because I
do not trust it?

The more I think about it the more I am concluding that we should start
with a more contrained model and require that once parent is
group_oom == 1 then children have to as well. If we ever find a usecase
to require a different scheme we can weaker it later. We cannot do that
other way around.

Tejun, Johannes what do you think about that?
-- 
Michal Hocko
SUSE Labs