On Tue, 5 Jun 2018, Michal Hocko wrote: > 1) comparision root with tail memcgs during the OOM killer is not fair > because we are comparing tasks with memcgs. > > This is true, but I do not think this matters much for workloads which > are going to use the feature. Why? Because the main consumers of the new > feature seem to be containers which really need some fairness when > comparing _workloads_ rather than processes. Those are unlikely to > contain any significant memory consumers in the root memcg. That would > be mostly common infrastructure. > There are users (us) who want to use the feature and not all processes are attached to leaf mem cgroups. The functionality can be provided in a generally useful way that doesn't require any specific hierarchy, and I implemented this in my patch series at https://marc.info/?l=linux-mm&m=152175563004458&w=2. That proposal to fix *all* of my concerns with the cgroup-aware oom killer as it sits in -mm, in it's entirety, only extends it so it is generally useful and does not remove any functionality. I'm not sure why we are discussing ways of moving forward when that patchset has been waiting for review for almost four months and, to date, I haven't seen an objection to. I don't know why we cannot agree on making solutions generally useful nor why that patchset has not been merged into -mm so that the whole feature can be merged. It's baffling. This is the first time I've encountered a perceived stalemate when there is a patchset sitting, unreviewed, that fixes all of the concerns that there are about the implementation sitting in -mm. This isn't a criticism just of comparing processes attached to root differently than leaf mem cgroups, it's how oom_score_adj influences that decision. It's trivial for a very small consumer (not "significant" memory consumer, as you put it) to require an oom kill from root instead of a leaf mem cgroup. I show in https://marc.info/?l=linux-mm&m=152175564104468&w=2 that changing the oom_score_adj of my bash shell attached to the root mem cgroup is considered equal to a 95GB leaf mem cgroup with the current implementation. > Is this is fixable? Yes, we would need to account in the root memcgs. > Why are we not doing that now? Because it has some negligible > performance overhead. Are there other ways? Yes we can approximate root > memcg memory consumption but I would rather wait for somebody seeing > that as a real problem rather than add hacks now without a strong > reason. > I fixed this in https://marc.info/?t=152175564500007&r=1&w=2, and from what I remmeber Roman actually liked it. > 2) Evading the oom killer by attaching processes to child cgroups which > basically means that a task can split up the workload into smaller > memcgs to hide their real memory consumption. > > Again true but not really anything new. Processes can already fork and > split up the memory consumption. Moreover it doesn't even require any > special privileges to do so unlike creating a sub memcg. Is this > fixable? Yes, untrusted workloads can setup group oom evaluation at the > delegation layer so all subgroups would be considered together. > Processes being able to fork to split up memory consumption is also fixed by https://marc.info/?l=linux-mm&m=152175564104467 just as creating subcontainers to intentionally or unintentionally subverting the oom policy is fixed. It solves both problems. > 3) Userspace has zero control over oom kill selection in leaf mem > cgroups. > > Again true but this is something that needs a good evaluation to not end > up in the fiasko we have seen with oom_score*. Current users demanding > this feature can live without any prioritization so blocking the whole > feature seems unreasonable. > One objection here is how the oom_score_adj of a process means something or doesn't mean something depending on what cgroup it is attached to. The cgroup-aware oom killer is cgroup aware. oom_score_adj should play no part. I fixed this with https://marc.info/?t=152175564500007&r=1&w=2. The other objection is that users do have cgroups that shouldn't be oom killed because they are important, either because they are required to provide a service for a smaller cgroup or because of business goals. We have cgroups that use more than half of system memory and they are allowed to do so because they are important. We would love to be able to bias against that cgroup to prefer others, or prefer cgroups for oom kill because they are less important. This was done for processes with oom_score_adj, we need it for a cgroup aware oom killer for the same reason. But notice even in https://marc.info/?l=linux-mm&m=152175563004458&w=2 that I said priority or adjustment can be added on top of the feature after it's merged. This itself is not precluding anything from being merged. > 4) Future extensibility to be backward compatible. > > David is wrong here IMHO. Any prioritization or oom selection policy > controls added in future are orthogonal to the oom_group concept added > by this patchset. Allowing memcg to be an oom entity is something that > we really want longterm. Global CGRP_GROUP_OOM is the most restrictive > semantic and softening it will be possible by a adding a new knob to > tell whether a memcg/hierarchy is a workload or a set of tasks. I've always said that the mechanism and policy in this patchset should be separated. I do that exact thing in https://marc.info/?l=linux-mm&m=152175564304469&w=2. I suggest that different subtrees will want (or the admin will require) different behaviors with regard to the mechanism. I've stated the problems (and there are others wrt mempolicy selection) that the current implementation has and given a full solution at https://marc.info/?l=linux-mm&m=152175563004458&w=2 that has not been reviewed. I would love feedback from anybody on this thread on that. I'm not trying to preclude the cgroup-aware oom killer from being merged, I'm the only person actively trying to get it merged. Thanks.