Re: [patch -mm v2 2/3] mm, memcg: replace cgroup aware oom killer mount option with tunable

David Rientjes <rientjes@xxxxxxxxxx> · Fri, 26 Jan 2018 14:52:59 -0800 (PST)

On Fri, 26 Jan 2018, Andrew Morton wrote:

> > -ECONFUSED.  We want to have a mount option that has the sole purpose of 
> > doing echo cgroup > /mnt/cgroup/memory.oom_policy?
> 
> Approximately.  Let me put it another way: can we modify your patchset
> so that the mount option remains, and continues to have a sufficiently
> same effect?  For backward compatibility.
> 

The mount option would exist solely to set the oom policy of the root mem 
cgroup, it would lose its effect of mandating that policy for any subtree 
since it would become configurable by the user if delegated.

Let me put it another way: if the cgroup aware oom killer is merged for 
4.16 without this patchset, userspace can reasonably infer the oom policy 
from checking how cgroups were mounted.  If my followup patchset were 
merged for 4.17, that's invalid and it becomes dependent on kernel 
version: it could have the "groupoom" mount option but configured through 
the root mem cgroup's memory.oom_policy to not be cgroup aware at all.

That inconsistency, to me, is unfair to burden userspace with.

> > This, and fixes to fairly compare the root mem cgroup with leaf mem 
> > cgroups, are essential before the feature is merged otherwise it yields 
> > wildly unpredictable (and unexpected, since its interaction with 
> > oom_score_adj isn't documented) results as I already demonstrated where 
> > cgroups with 1GB of usage are killed instead of 6GB workers outside of 
> > that subtree.
> 
> OK, so Roman's new feature is incomplete: it satisfies some use cases
> but not others.  And we kinda have a plan to address the other use
> cases in the future.
> 

Those use cases are also undocumented such that the user doesn't know the 
behavior they are opting into.  Nowhere in the patchset does it mention 
anything about oom_score_adj other than being oom disabled.  It doesn't 
mention that a per-process tunable now depends strictly on whether it is 
attached to root or not.  It specifies a fair comparison between the root 
mem cgroup and leaf mem cgroups, which is obviously incorrect by the 
implementation itself.  So I'm not sure the user would know which use 
cases it is valid for, which is why I've been trying to make it generally 
purposeful and documented.

> There's nothing wrong with that!  As long as we don't break existing
> setups while evolving the feature.  How do we do that?
> 

We'd break the setups that actually configure their cgroups and processes 
to abide by the current implementation since we'd need to discount 
oom_score_adj from the the root mem cgroup usage to fix it.

There hasn't been any feedback on v2 of my patchset that would suggest 
changes are needed.  I think we all recognize the shortcoming that it is 
addressing.  The only feedback on v1, the need for memory.oom_group as a 
separate tunable, has been addressed in v2.  What are we waiting for?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>