On Thu, 21 Sep 2017, Johannes Weiner wrote: > > The issue is that if you opt-in to the new feature, then you are forced to > > change /proc/pid/oom_score_adj of all processes attached to a cgroup that > > you do not want oom killed based on size to be oom disabled. > > You're assuming that most people would want to influence the oom > behavior in the first place. I think the opposite is the case: most > people don't care as long as the OOM killer takes the intent the user > has expressed wrt runtime containerization/grouping into account. > If you do not want to influence the oom behavior, do not change memory.oom_priority from its default. It's that simple. > > The kernel provides no other remedy without oom priorities since the > > new feature would otherwise disregard oom_score_adj. > > As of v8, it respects this setting and doesn't kill min score tasks. > That's the issue. To protect a memory cgroup from being oom killed in a system oom condition, you need to change oom_score_adj of *all* processes attached to be oom disabled. Then, you have a huge problem in memory cgroup oom conditions because nothing can be killed in that hierarchy itself. > > The patchset compares memory cgroup size relative to sibling cgroups only, > > the same comparison for memory.oom_priority. There is a guarantee > > provided on how cgroup size is compared in select_victim_memcg(), it > > hierarchically accumulates the "size" from leaf nodes up to the root memcg > > and then iterates the tree comparing sizes between sibling cgroups to > > choose a victim memcg. That algorithm could be more elaborately described > > in the documentation, but we simply cannot change the implementation of > > select_victim_memcg() later even without oom priorities since users cannot > > get inconsistent results after opting into a feature between kernel > > versions. I believe the selection criteria should be implemented to be > > deterministic, as select_victim_memcg() does, and the documentation should > > fully describe what the selection criteria is, and then allow the user to > > decide. > > I wholeheartedly disagree. We have changed the behavior multiple times > in the past. In fact, you have arguably done the most drastic changes > to the algorithm since the OOM killer was first introduced. E.g. > > a63d83f427fb oom: badness heuristic rewrite > > And that's completely fine. Because this thing is not a resource > management tool for userspace, it's the kernel saving itself. At best > in a manner that's not too surprising to userspace. > When I did that, I had to add /proc/pid/oom_score_adj to allow userspace to influence selection. We came up with /proc/pid/oom_score_adj when working with kde, openssh, chromium, and udev because they cared about the ability to influence the decisionmaking. I'm perfectly happy with the new heuristic presented in this patchset, I simply want userspace to be able to influence it, if it desires. Requiring userspace to set all processes to be oom disabled to protect a hierarchy is totally and completely broken. It livelocks the memory cgroup if it is oom itself. > To me, your argument behind the NAK still boils down to "this doesn't > support my highly specialized usecase." But since it doesn't prohibit > your usecase - which isn't even supported upstream, btw - this really > doesn't carry much weight. > > I'd say if you want configurability on top of Roman's code, please > submit patches and push the case for these in a separate effort. > Roman implemented memory.oom_priority himself, it has my Tested-by, and it allows users who want to protect high priority memory cgroups from using the size based comparison for all other cgroups that we very much desire. -- To unsubscribe from this list: send the line "unsubscribe cgroups" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html