On Wed, 14 Mar 2018, Roman Gushchin wrote: > > The cgroup aware oom killer is needlessly enforced for the entire system > > by a mount option. It's unnecessary to force the system into a single > > oom policy: either cgroup aware, or the traditional process aware. > > Can you, please, provide a real-life example, when using per-process > and cgroup-aware OOM killer depending on OOM scope is beneficial? > Hi Roman, This question is only about per-process vs cgroup-aware, not about the need for individual cgroup vs hierarchical subtree, so I'll only focus on that. Three reasons: - Does not lock the entire system into a single methodology. Users working in a subtree can default to what they are used to: per-process oom selection even though their subtree might be targeted by a system policy level decision at the root. This allow them flexibility to organize their subtree intuitively for use with other controllers in a single hierarchy. The real-world example is a user who currently organizes their subtree for this purpose and has defined oom_score_adj appropriately and now regresses if the admin mounts with the needless "groupoom" option. - Allows changing the oom policy at runtime without remounting the entire cgroup fs. Depending on how cgroups are going to be used, per-process vs cgroup-aware may be mandated separately. This is a trait only of the mem cgroup controller, the root level oom policy is no different from the subtree and depends directly on how the subtree is organized. If other controllers are already being used, requiring a remount to change the system-wide oom policy is an unnecessary burden. The real-world example is systems software that either supports user subtrees or strictly subtrees that it maintains itself. While other controllers are used, the mem cgroup oom policy can be changed at runtime rather than requiring a remount and reorganizing other controllers exactly as before. - Can be extended to cgroup v1 if necessary. There is no need for a new cgroup v1 mount option and mem cgroup oom selection is not dependant on any functionality provided by cgroup v2. The policies introduced here work exactly the same if used with cgroup v1. The real-world example is a cgroup configuration that hasn't had the ability to move to cgroup v2 yet and still would like to use cgroup-aware oom selection with a very trivial change to add the memory.oom_policy file to the cgroup v1 filesystem. > It might be quite confusing, depending on configuration. > From inside a container you can have different types of OOMs, > depending on parent's cgroup configuration, which is not even > accessible for reading from inside. > Right, and the oom is the result of the parent's limit that is outside the control of the user. That limit, and now oom policy, is defined by the user controlling the ancestor of the subtree. The user need not be concerned that it was singled out for oom kill: that policy decision is outside hiso or her control. memory.oom_group can certainly be delegated to the user, but the targeting cannot be changed or evaded. However, this patchset also provides them with the ability to define their own oom policy for subcontainers that they create themselves. > Also, it's probably good to have an interface to show which policies > are available. > This comes back to the user interface question. I'm very happy to address any way that the interface can be made better, even though I think what is currently proposed is satisfactory. I think your comment eludes to thp like enabling where we have "[always] madvise never"? I'm speculating that you may be happier with memory.oom_policy becoming "[none] cgroup tree" and extended for additional policies later? Otherwise the user would need to try the write and test the return value, which purely depends on whether the policy is available or not. I'm rather indifferent to either interface, but if you would prefer the "[none] cgroup tree" appearance, I'll change to that.