On Fri, 26 Jan 2018, Andrew Morton wrote: > > > > -ECONFUSED. We want to have a mount option that has the sole purpose of > > > > doing echo cgroup > /mnt/cgroup/memory.oom_policy? > > > > > > Approximately. Let me put it another way: can we modify your patchset > > > so that the mount option remains, and continues to have a sufficiently > > > same effect? For backward compatibility. > > > > > > > The mount option would exist solely to set the oom policy of the root mem > > cgroup, it would lose its effect of mandating that policy for any subtree > > since it would become configurable by the user if delegated. > > Why can't we propagate the mount option into the subtrees? > > If the user then alters that behaviour with new added-by-David tunables > then fine, that's still backward compatible. > It's not, if you look for the "groupoom" mount option it will specify two different things: the entire hierarchy is locked into a single per-cgroup usage comparison (Roman's original patchset), and entire hierarchy had an initial oom policy set which could have subsequently changed (my extension). With memory.oom_policy you need to query what the effective policy is, checking for "groupoom" is entirely irrelevant, it was only the initial setting. Thus, if memory.oom_policy is going to be merged in the future, it necessarily obsoletes the mount option. It would depend on the kernel version to determine its meaning. I'm struggling to see the benefit of simply not reviewing patches that build off the original and merging a patchset early. What are we gaining? > > Let me put it another way: if the cgroup aware oom killer is merged for > > 4.16 without this patchset, userspace can reasonably infer the oom policy > > from checking how cgroups were mounted. If my followup patchset were > > merged for 4.17, that's invalid and it becomes dependent on kernel > > version: it could have the "groupoom" mount option but configured through > > the root mem cgroup's memory.oom_policy to not be cgroup aware at all. > > That concern seems unreasonable to me. Is an application *really* > going to peek at the mount options to figure out what its present oom > policy is? Well, maybe. But that's a pretty dopey thing to do and I > wouldn't lose much sleep over breaking any such application in the very > unlikely case that such a thing was developed in that two-month window. > It's not dopey, it's the only way that any userspace can determine what process is going to be oom killed! That policy will dictate how the cgroup hierarchy is configured without my extension, there's no other way to prefer or bias processes. How can a userspace cgroup manager possibly construct a cgroup v2 hierarchy with expected oom kill behavior if it is not peeking at the mount option? My concern is that if extended with my patchset the mount option itself becomes obsolete and then peeking at it is irrelevant to the runtime behavior! > > > There's nothing wrong with that! As long as we don't break existing > > > setups while evolving the feature. How do we do that? > > > > We'd break the setups that actually configure their cgroups and processes > > to abide by the current implementation since we'd need to discount > > oom_score_adj from the the root mem cgroup usage to fix it. > > Am having trouble understanding that. Expand, please? > > Can we address this (and other such) issues in the (interim) > documentation? > This point isn't a documentation issue at all, this is the fact that oom_score_adj is only effective for the root mem cgroup. If the user is fully aware of the implementation, it does not change the fact that he or she will construct their cgroup hierarchy and attach processes to it to abide by the behavior. That is the breakage that I am concerned about. An example: you have a log scraper that is running with /proc/pid/oom_score_adj == 999. It's best effort, it can be killed, we'll retry the next time if the system has memory available. This is partly why oom_adj and oom_score_adj exist and is used on production systems. If you attach that process to an unlimited mem cgroup dedicated to system daemons purely for the rich stats that mem cgroup provides, this breaks the oom_score_adj setting solely because it's attached to the cgroup. On system-wide oom, it is no longer the killed process merely because it is attached to an unlimited child cgroup. This is not the only such example: this occurs for any process attached to a cgroup. -- To unsubscribe from this list: send the line "unsubscribe cgroups" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html