On Thu, 21 Sep 2017, Johannes Weiner wrote: > That's a ridiculous nak. > > The fact that this patch series doesn't solve your particular problem > is not a technical argument to *reject* somebody else's work to solve > a different problem. It's not a regression when behavior is completely > unchanged unless you explicitly opt into a new functionality. > > So let's stay reasonable here. > The issue is that if you opt-in to the new feature, then you are forced to change /proc/pid/oom_score_adj of all processes attached to a cgroup that you do not want oom killed based on size to be oom disabled. The kernel provides no other remedy without oom priorities since the new feature would otherwise disregard oom_score_adj. In that case, userspace is racing in two ways: (1) attach of process to a memcg you want to protect from oom kill (first class, vital, large memory hog job) to set to oom disable and (2) adjustment of other cgroups to make them eligible after first oom kill. It doesn't have anything to do with my particular usecase, but rather the ability of userspace to influence the decisions of the kernel. Previous to this patchset, when selection is done based on process size, userspace has full control over selection. After this patchset, userspace has no control other than setting all processes to be oom disabled if the largest memory consumer is to be protected. Roman's memory.oom_priority provides a perfect solution for userspace to be able to influence this decision making and causes no change in behavior for users who choose not to tune memory.oom_priority. The nack originates from the general need for userspace influence over oom victim selection and to avoid userspace needing to take the rather drastic measure of setting all processes to be oom disabled to prevent oom kill in kernels before oom priorities are introduced. > The patch series has merit as it currently stands. It makes OOM > killing in a cgrouped system fairer and less surprising. Whether you > have the ability to influence this in a new way is an entirely > separate discussion. It's one that involves ABI and user guarantees. > > Right now Roman's patches make no guarantees on how the cgroup tree is > descended. But once we define an interface for prioritization, it > locks the victim algorithm into place to a certain extent. > The patchset compares memory cgroup size relative to sibling cgroups only, the same comparison for memory.oom_priority. There is a guarantee provided on how cgroup size is compared in select_victim_memcg(), it hierarchically accumulates the "size" from leaf nodes up to the root memcg and then iterates the tree comparing sizes between sibling cgroups to choose a victim memcg. That algorithm could be more elaborately described in the documentation, but we simply cannot change the implementation of select_victim_memcg() later even without oom priorities since users cannot get inconsistent results after opting into a feature between kernel versions. I believe the selection criteria should be implemented to be deterministic, as select_victim_memcg() does, and the documentation should fully describe what the selection criteria is, and then allow the user to decide. > It also involves a discussion about how much control userspace should > have over OOM killing in the first place. It's a last-minute effort to > save the kernel from deadlocking on memory. Whether that is the time > and place to have userspace make clever resource management decisions > is an entirely different thing than what Roman is doing. > > But this patch series doesn't prevent any such future discussion and > implementations, and it's not useless without it. So let's not > conflate these two things, and hold the priority patch for now. > Roman is planning on introducing memory.oom_priority back into the patchset per https://marc.info/?l=linux-kernel&m=150574701126877 and I agree with the very clear semantic that it introduces: to have the size-based comparison use the same rules as the userspace priority comparison. It's very powerful and I'm happy to ack the final version that he plans on posting. -- To unsubscribe from this list: send the line "unsubscribe cgroups" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html