On Tue, Sep 26, 2017 at 01:21:34PM +0200, Michal Hocko wrote: > On Tue 26-09-17 11:59:25, Roman Gushchin wrote: > > On Mon, Sep 25, 2017 at 10:25:21PM +0200, Michal Hocko wrote: > > > On Mon 25-09-17 19:15:33, Roman Gushchin wrote: > > > [...] > > > > I'm not against this model, as I've said before. It feels logical, > > > > and will work fine in most cases. > > > > > > > > In this case we can drop any mount/boot options, because it preserves > > > > the existing behavior in the default configuration. A big advantage. > > > > > > I am not sure about this. We still need an opt-in, ragardless, because > > > selecting the largest process from the largest memcg != selecting the > > > largest task (just consider memcgs with many processes example). > > > > As I understand Johannes, he suggested to compare individual processes with > > group_oom mem cgroups. In other words, always select a killable entity with > > the biggest memory footprint. > > > > This is slightly different from my v8 approach, where I treat leaf memcgs > > as indivisible memory consumers independent on group_oom setting, so > > by default I'm selecting the biggest task in the biggest memcg. > > My reading is that he is actually proposing the same thing I've been > mentioning. Simply select the biggest killable entity (leaf memcg or > group_oom hierarchy) and either kill the largest task in that entity > (for !group_oom) or the whole memcg/hierarchy otherwise. He wrote the following: "So I'm leaning toward the second model: compare all oomgroups and standalone tasks in the system with each other, independent of the failed hierarchical control structure. Then kill the biggest of them." > > > While the approach suggested by Johannes looks clear and reasonable, > > I'm slightly concerned about possible implementation issues, > > which I've described below: > > > > > > > > > The only thing, I'm slightly concerned, that due to the way how we calculate > > > > the memory footprint for tasks and memory cgroups, we will have a number > > > > of weird edge cases. For instance, when putting a single process into > > > > the group_oom memcg will alter the oom_score significantly and result > > > > in significantly different chances to be killed. An obvious example will > > > > be a task with oom_score_adj set to any non-extreme (other than 0 and -1000) > > > > value, but it can also happen in case of constrained alloc, for instance. > > > > > > I am not sure I understand. Are you talking about root memcg comparing > > > to other memcgs? > > > > Not only, but root memcg in this case will be another complication. We can > > also use the same trick for all memcg (define memcg oom_score as maximum oom_score > > of the belonging tasks), it will turn group_oom into pure container cleanup > > solution, without changing victim selection algorithm > > I fail to see the problem to be honest. Simply evaluate the memcg_score > you have so far with one minor detail. You only check memcgs which have > tasks (rather than check for leaf node check) or it is group_oom. An > intermediate memcg will get a cumulative size of the whole subhierarchy > and then you know you can skip the subtree because any subtree can be larger. > > > But, again, I'm not against approach suggested by Johannes. I think that overall > > it's the best possible semantics, if we're not taking some implementation details > > into account. > > I do not see those implementation details issues and let me repeat do > not develop a semantic based on implementation details. There are no problems in "select the biggest leaf or group_oom memcg, then kill the biggest task or all tasks depending on group_oom" approach, which you're describing. Comparing tasks and memcgs (what Johannes is suggesting) may have some issues. Thanks! -- To unsubscribe from this list: send the line "unsubscribe cgroups" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html