On Thu 22-08-19 17:34:54, Yafang Shao wrote: > On Thu, Aug 22, 2019 at 5:19 PM Michal Hocko <mhocko@xxxxxxxx> wrote: > > > > On Thu 22-08-19 04:56:29, Yafang Shao wrote: > > > - Why we need a per memcg oom_score_adj setting ? > > > This is easy to deploy and very convenient for container. > > > When we use container, we always treat memcg as a whole, if we have a per > > > memcg oom_score_adj setting we don't need to set it process by process. > > > > Why cannot an initial process in the cgroup set the oom_score_adj and > > other processes just inherit it from there? This sounds trivial to do > > with a startup script. > > > > That is what we used to do before. > But it can't apply to the running containers. > > > > > It will make the user exhausted to set it to all processes in a memcg. > > > > Then let's have scripts to set it as they are less prone to exhaustion > > ;) > > That is not easy to deploy it to the production environment. What is hard about a simple loop over tasklist exported by cgroup and apply a value to oom_score_adj? [...] > > Besides that. What is the hierarchical semantic? Say you have hierarchy > > A (oom_score_adj = 1000) > > \ > > B (oom_score_adj = 500) > > \ > > C (oom_score_adj = -1000) > > > > put the above summing up aside for now and just focus on the memcg > > adjusting? > > I think that there's no conflict between children's oom_score_adj, > that is different with memory.max. > So it is not neccessary to consider the parent's oom_sore_adj. Each exported cgroup tuning _has_ to be hierarchical so that an admin can override children setting in order to safely delegate the configuration. Last but not least, oom_score_adj has proven to be a terrible interface that is essentially close to unusable to anything outside of extreme values (-1000 and very arguably 1000). Making it cgroup aware without changing oom victim selection to consider cgroup as a whole will also be a pain so I am afraid that this is a dead end path. We can discuss cgroup aware oom victim selection for sure and there are certainly reasonable usecases to back that functionality. Please refer to discussion from 2017/2018 (dubbed as "cgroup-aware OOM killer"). But be warned this is a tricky area and there was a fundamental disagreement on how things should be classified without a clear way to reach consensus. What we have right now is the only agreement we could reach. It is likely possible that the only more clever cgroup aware oom selection has to be implemented in the userspace with an understanding of the specific workload. -- Michal Hocko SUSE Labs