Re: [v8 0/4] cgroup-aware OOM killer

Roman Gushchin <guro@xxxxxx> · Wed, 20 Sep 2017 14:53:41 -0700

On Mon, Sep 18, 2017 at 08:14:05AM +0200, Michal Hocko wrote:
> On Fri 15-09-17 08:23:01, Roman Gushchin wrote:
> > On Fri, Sep 15, 2017 at 12:58:26PM +0200, Michal Hocko wrote:
> > > On Thu 14-09-17 09:05:48, Roman Gushchin wrote:
> > > > On Thu, Sep 14, 2017 at 03:40:14PM +0200, Michal Hocko wrote:
> > > > > On Wed 13-09-17 14:56:07, Roman Gushchin wrote:
> > > > > > On Wed, Sep 13, 2017 at 02:29:14PM +0200, Michal Hocko wrote:
> > > > > [...]
> > > > > > > I strongly believe that comparing only leaf memcgs
> > > > > > > is more straightforward and it doesn't lead to unexpected results as
> > > > > > > mentioned before (kill a small memcg which is a part of the larger
> > > > > > > sub-hierarchy).
> > > > > > 
> > > > > > One of two main goals of this patchset is to introduce cgroup-level
> > > > > > fairness: bigger cgroups should be affected more than smaller,
> > > > > > despite the size of tasks inside. I believe the same principle
> > > > > > should be used for cgroups.
> > > > > 
> > > > > Yes bigger cgroups should be preferred but I fail to see why bigger
> > > > > hierarchies should be considered as well if they are not kill-all. And
> > > > > whether non-leaf memcgs should allow kill-all is not entirely clear to
> > > > > me. What would be the usecase?
> > > > 
> > > > We definitely want to support kill-all for non-leaf cgroups.
> > > > A workload can consist of several cgroups and we want to clean up
> > > > the whole thing on OOM.
> > > 
> > > Could you be more specific about such a workload? E.g. how can be such a
> > > hierarchy handled consistently when its sub-tree gets killed due to
> > > internal memory pressure?
> > 
> > Or just system-wide OOM.
> > 
> > > Or do you expect that none of the subtree will
> > > have hard limit configured?
> > 
> > And this can also be a case: the whole workload may have hard limit
> > configured, while internal memcgs have only memory.low set for "soft"
> > prioritization.
> > 
> > > 
> > > But then you just enforce a structural restriction on your configuration
> > > because
> > > 	root
> > >         /  \
> > >        A    D
> > >       /\   
> > >      B  C
> > > 
> > > is a different thing than
> > > 	root
> > >         / | \
> > >        B  C  D
> > >
> > 
> > I actually don't have a strong argument against an approach to select
> > largest leaf or kill-all-set memcg. I think, in practice there will be
> > no much difference.

I've tried to implement this approach, and it's really arguable.
Although your example looks reasonable, the opposite example is also valid:
you might want to compare whole hierarchies, and it's a quite typical usecase.

Assume, you have several containerized workloads on a machine (probably,
each will be contained in a memcg with memory.max set), with some hierarchy
of cgroups inside. Then in case of global memory shortage we want to reclaim
some memory from the biggest workload, and the selection should not depend
on group_oom settings. It would be really strange, if setting group_oom will
higher the chances to be killed.

In other words, let's imagine processes as leaf nodes in memcg tree. We decided
to select the biggest memcg and kill one or more processes inside (depending
on group_oom setting), but the memcg selection doesn't depend on it.
We do not compare processes from different cgroups, as well as cgroups with
processes. The same should apply to cgroups: why do we want to compare cgroups
from different sub-trees?

While size-based comparison can be implemented with this approach,
the priority-based is really weird (as David mentioned).
If priorities have no hierarchical meaning at all, we lack the very important
ability to enforce hierarchy oom_priority. Otherwise we have to invent some
complex rules of oom_priority propagation (e.g. is someone is raising
the oom_priority in parent, should it be applied to children immediately, etc).

The oom_group knob meaning also becoms more complex. It affects both
the victim selection and OOM action. _ANY_ mechanism which allows to affect
OOM victim selection (either priorities, either bpf-based approach) should
not have global system-wide meaning, it breaks everything.

I do understand your point, but the same is true for other stuff, right?
E.g. cpu time distribution (and io, etc) depends on hierarchy configuration.
It's a limitation, but it's ok, as user should create a hierarchy which
reflects some logical relations between processes and groups of processes.
Otherwise we're going to the configuration hell.

In any case, OOM is a last resort mechanism. The goal is to reclaim some memory
and do not crash the system or do not leave it in totally broken state.
Any really complex mm in userspace should be applied _before_ OOM happens.
So, I don't think we have to support all possible configurations here,
if we're able to achieve the main goal (kill some processes and do not leave
broken systems/containers).
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html