On Fri, 26 Jan 2018, Andrew Morton wrote: > > -ECONFUSED. We want to have a mount option that has the sole purpose of > > doing echo cgroup > /mnt/cgroup/memory.oom_policy? > > Approximately. Let me put it another way: can we modify your patchset > so that the mount option remains, and continues to have a sufficiently > same effect? For backward compatibility. > The mount option would exist solely to set the oom policy of the root mem cgroup, it would lose its effect of mandating that policy for any subtree since it would become configurable by the user if delegated. Let me put it another way: if the cgroup aware oom killer is merged for 4.16 without this patchset, userspace can reasonably infer the oom policy from checking how cgroups were mounted. If my followup patchset were merged for 4.17, that's invalid and it becomes dependent on kernel version: it could have the "groupoom" mount option but configured through the root mem cgroup's memory.oom_policy to not be cgroup aware at all. That inconsistency, to me, is unfair to burden userspace with. > > This, and fixes to fairly compare the root mem cgroup with leaf mem > > cgroups, are essential before the feature is merged otherwise it yields > > wildly unpredictable (and unexpected, since its interaction with > > oom_score_adj isn't documented) results as I already demonstrated where > > cgroups with 1GB of usage are killed instead of 6GB workers outside of > > that subtree. > > OK, so Roman's new feature is incomplete: it satisfies some use cases > but not others. And we kinda have a plan to address the other use > cases in the future. > Those use cases are also undocumented such that the user doesn't know the behavior they are opting into. Nowhere in the patchset does it mention anything about oom_score_adj other than being oom disabled. It doesn't mention that a per-process tunable now depends strictly on whether it is attached to root or not. It specifies a fair comparison between the root mem cgroup and leaf mem cgroups, which is obviously incorrect by the implementation itself. So I'm not sure the user would know which use cases it is valid for, which is why I've been trying to make it generally purposeful and documented. > There's nothing wrong with that! As long as we don't break existing > setups while evolving the feature. How do we do that? > We'd break the setups that actually configure their cgroups and processes to abide by the current implementation since we'd need to discount oom_score_adj from the the root mem cgroup usage to fix it. There hasn't been any feedback on v2 of my patchset that would suggest changes are needed. I think we all recognize the shortcoming that it is addressing. The only feedback on v1, the need for memory.oom_group as a separate tunable, has been addressed in v2. What are we waiting for? -- To unsubscribe from this list: send the line "unsubscribe cgroups" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html