Re: [patch -mm v2 2/3] mm, memcg: replace cgroup aware oom killer mount option with tunable

David Rientjes <rientjes@xxxxxxxxxx> · Mon, 29 Jan 2018 14:16:23 -0800 (PST)

On Fri, 26 Jan 2018, Andrew Morton wrote:

> > > > -ECONFUSED.  We want to have a mount option that has the sole purpose of 
> > > > doing echo cgroup > /mnt/cgroup/memory.oom_policy?
> > > 
> > > Approximately.  Let me put it another way: can we modify your patchset
> > > so that the mount option remains, and continues to have a sufficiently
> > > same effect?  For backward compatibility.
> > > 
> > 
> > The mount option would exist solely to set the oom policy of the root mem 
> > cgroup, it would lose its effect of mandating that policy for any subtree 
> > since it would become configurable by the user if delegated.
> 
> Why can't we propagate the mount option into the subtrees?
> 
> If the user then alters that behaviour with new added-by-David tunables
> then fine, that's still backward compatible.
> 

It's not, if you look for the "groupoom" mount option it will specify two 
different things: the entire hierarchy is locked into a single per-cgroup 
usage comparison (Roman's original patchset), and entire hierarchy had an 
initial oom policy set which could have subsequently changed (my 
extension).  With memory.oom_policy you need to query what the effective 
policy is, checking for "groupoom" is entirely irrelevant, it was only the 
initial setting.

Thus, if memory.oom_policy is going to be merged in the future, it 
necessarily obsoletes the mount option.  It would depend on the kernel 
version to determine its meaning.

I'm struggling to see the benefit of simply not reviewing patches that 
build off the original and merging a patchset early.  What are we gaining?

> > Let me put it another way: if the cgroup aware oom killer is merged for 
> > 4.16 without this patchset, userspace can reasonably infer the oom policy 
> > from checking how cgroups were mounted.  If my followup patchset were 
> > merged for 4.17, that's invalid and it becomes dependent on kernel 
> > version: it could have the "groupoom" mount option but configured through 
> > the root mem cgroup's memory.oom_policy to not be cgroup aware at all.
> 
> That concern seems unreasonable to me.  Is an application *really*
> going to peek at the mount options to figure out what its present oom
> policy is?  Well, maybe.  But that's a pretty dopey thing to do and I
> wouldn't lose much sleep over breaking any such application in the very
> unlikely case that such a thing was developed in that two-month window.
> 

It's not dopey, it's the only way that any userspace can determine what 
process is going to be oom killed!  That policy will dictate how the 
cgroup hierarchy is configured without my extension, there's no other way 
to prefer or bias processes.

How can a userspace cgroup manager possibly construct a cgroup v2 
hierarchy with expected oom kill behavior if it is not peeking at the 
mount option?

My concern is that if extended with my patchset the mount option itself 
becomes obsolete and then peeking at it is irrelevant to the runtime 
behavior!

> > > There's nothing wrong with that!  As long as we don't break existing
> > > setups while evolving the feature.  How do we do that?
> > 
> > We'd break the setups that actually configure their cgroups and processes 
> > to abide by the current implementation since we'd need to discount 
> > oom_score_adj from the the root mem cgroup usage to fix it.
> 
> Am having trouble understanding that.  Expand, please?
> 
> Can we address this (and other such) issues in the (interim)
> documentation?
> 

This point isn't a documentation issue at all, this is the fact that 
oom_score_adj is only effective for the root mem cgroup.  If the user is 
fully aware of the implementation, it does not change the fact that he or 
she will construct their cgroup hierarchy and attach processes to it to 
abide by the behavior.  That is the breakage that I am concerned about.

An example: you have a log scraper that is running with 
/proc/pid/oom_score_adj == 999.  It's best effort, it can be killed, we'll 
retry the next time if the system has memory available.  This is partly 
why oom_adj and oom_score_adj exist and is used on production systems.

If you attach that process to an unlimited mem cgroup dedicated to system 
daemons purely for the rich stats that mem cgroup provides, this breaks 
the oom_score_adj setting solely because it's attached to the cgroup.  On 
system-wide oom, it is no longer the killed process merely because it is 
attached to an unlimited child cgroup.  This is not the only such example: 
this occurs for any process attached to a cgroup.
--
To unsubscribe from this list: send the line "unsubscribe cgroups" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html