On Fri, 13 Oct 2017, Roman Gushchin wrote: > > Think about it in a different way: we currently compare per-process usage > > and userspace has /proc/pid/oom_score_adj to adjust that usage depending > > on priorities of that process and still oom kill if there's a memory leak. > > Your heuristic compares per-cgroup usage, it's the cgroup-aware oom killer > > after all. We don't need a strict memory.oom_priority that outranks all > > other sibling cgroups regardless of usage. We need a memory.oom_score_adj > > to adjust the per-cgroup usage. The decisionmaking in your earlier > > example would be under the control of C/memory.oom_score_adj and > > D/memory.oom_score_adj. Problem solved. > > > > It also solves the problem of userspace being able to influence oom victim > > selection so now they can protect important cgroups just like we can > > protect important processes today. > > > > And since this would be hierarchical usage, you can trivially infer root > > mem cgroup usage by subtraction of top-level mem cgroup usage. > > > > This is a powerful solution to the problem and gives userspace the control > > they need so that it can work in all usecases, not a subset of usecases. > > You're right that per-cgroup oom_score_adj may resolve the issue with > too strict semantics of oom_priorities. But I believe nobody likes > the existing per-process oom_score_adj interface, and there are reasons behind. The previous heuristic before I rewrote the oom killer used /proc/pid/oom_adj which acted as a bitshift on mm->total_vm, which was a much more difficult interface to use as I'm sure you can imagine. People ended up only using it to polarize selection: either -17 to oom disable a process, -16 to bias against it, and 15 to prefer it. Nobody used anything in between and I worked with openssh, udev, kde, and chromium to get a consensus on the oom_score_adj semantics. People do use it to protect against memory leaks and to prevent oom killing important processes when something else can be sacrificed, unless there's a leak. > Especially in case of memcg-OOM, getting the idea how exactly oom_score_adj > will work is not trivial. I suggest defining it in the terms used for previous iterations of the patchset: do hierarchical scoring so that each level of the hierarchy has usage information for each subtree. You can get root mem cgroup usage with complete fairness by subtraction with this method. When comparing usage at each level of the hierarchy, you can propagate the eligibility of processes in that subtree much like you do today. I agree with your change to make the oom killer a no-op if selection races with the actual killing rather than falling back to the old heuristic. I'm happy to help add a Tested-by once we settle the other issues with that change. At each level, I would state that memory.oom_score_adj has the exact same semantics as /proc/pid/oom_score_adj. In this case, it would simply be defined as a proportion of the parent's limit. If the hierarchy is iterated starting at the root mem cgroup for system ooms and at the root of the oom memcg for memcg ooms, this should lead to the exact same oom killing behavior, which is desired. This solution would address the three concerns that I had: it allows the root mem cgroup to be compared fairly with leaf mem cgroups (with the bonus of not requiring root mem cgroup accounting thanks to your heuristic using global vmstats), it allows userspace to influence the decisionmaking so that users can protect cgroups that use 50% of memory because they are important, and it completely avoids users being able to change victim selection simply by creating child mem cgroups. This would be a very powerful patchset. -- To unsubscribe from this list: send the line "unsubscribe cgroups" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html