Re: [patch -mm v2 1/3] mm, memcg: introduce per-memcg oom policy tunable

Michal Hocko <mhocko@xxxxxxxxxx> · Fri, 26 Jan 2018 18:15:48 +0100

On Thu 25-01-18 15:53:45, David Rientjes wrote:
> The cgroup aware oom killer is needlessly declared for the entire system
> by a mount option.  It's unnecessary to force the system into a single
> oom policy: either cgroup aware, or the traditional process aware.
> 
> This patch introduces a memory.oom_policy tunable for all mem cgroups.
> It is currently a no-op: it can only be set to "none", which is its
> default policy.  It will be expanded in the next patch to define cgroup
> aware oom killer behavior.
> 
> This is an extensible interface that can be used to define cgroup aware
> assessment of mem cgroup subtrees or the traditional process aware
> assessment.
> 

So what is the actual semantic and scope of this policy. Does it apply
only down the hierarchy. Also how do you compare cgroups with different
policies? Let's say you have
          root
         / |  \
        A  B   C
       / \    / \
      D   E  F   G

Assume A: cgroup, B: oom_group=1, C: tree, G: oom_group=1

Now we have the global OOM killer to choose a victim. From a quick
glance over those patches, it seems that we will be comparing only
tasks because root->oom_policy != MEMCG_OOM_POLICY_CGROUP. A, B and C
policies are ignored. Moreover If I select any of B's tasks then I will
happily kill it breaking the expectation that the whole memcg will go
away. Weird, don't you think? Or did I misunderstand?

So let's assume that root: cgroup. Then we are finally comparing
cgroups. D, E, B, C. Of those D, E and F do not have any
policy. Do they inherit their policy from the parent? If they don't then
we should be comparing their tasks separately, no? The code disagrees
because once we are in the cgroup mode, we do not care about separate
tasks.

Let's say we choose C because it has the largest cumulative consumption.
It is not oom_group so it will select a task from F, G. Again you are
breaking oom_group policy of G if you kill a single task. So you would
have to be recursive here. That sounds fixable though. Just be
recursive.

Then you say

> Another benefit of such an approach is that an admin can lock in a
> certain policy for the system or for a mem cgroup subtree and can
> delegate the policy decision to the user to determine if the kill should
> originate from a subcontainer, as indivisible memory consumers
> themselves, or selection should be done per process.

And the code indeed doesn't check oom_policy on each level of the
hierarchy, unless I am missing something. So the subgroup is simply
locked in to the oom_policy parent has chosen. That is not the case for
the tree policy.

So look how we are comparing cumulative groups without policy with
groups with policy with subtrees. Either I have grossly misunderstood
something or this is massively inconsistent and it doesn't make much
sense to me. Root memcg without cgroup policy will simply turn off the
whole thing for the global OOM case. So you really need to enable it
there but then it is not really clear how to configure lower levels.

>From the above it seems that you are more interested in memcg OOMs and
want to give different hierarchies different policies but you quickly
hit the similar inconsistencies there as well.

I am not sure how extensible this is actually. How do we place
priorities on top?

> Signed-off-by: David Rientjes <rientjes@xxxxxxxxxx>
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe cgroups" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html