[Sorry for a late reply but I was mostly offline last 2 weeks] On Tue 09-05-23 06:50:59, 程垲涛 Chengkaitao Cheng wrote: > At 2023-05-08 22:18:18, "Michal Hocko" <mhocko@xxxxxxxx> wrote: [...] > >Your cover letter mentions that then "all processes in the cgroup as a > >whole". That to me reads as oom.group oom killer policy. But a brief > >look into the patch suggests you are still looking at specific tasks and > >this has been a concern in the previous version of the patch because > >memcg accounting and per-process accounting are detached. > > I think the memcg accounting may be more reasonable, as its memory > statistics are more comprehensive, similar to active page cache, which > also increases the probability of OOM-kill. In the new patch, all the > shared memory will also consume the oom_protect quota of the memcg, > and the process's oom_protect quota of the memcg will decrease. I am sorry but I do not follow. Could you elaborate please? Are you arguing for per memcg or per process metrics? [...] > >> In the final discussion of patch v2, we discussed that although the adjustment range > >> of oom_score_adj is [-1000,1000], but essentially it only allows two usecases > >> (OOM_SCORE_ADJ_MIN, OOM_SCORE_ADJ_MAX) reliably. Everything in between is > >> clumsy at best. In order to solve this problem in the new patch, I introduced a new > >> indicator oom_kill_inherit, which counts the number of times the local and child > >> cgroups have been selected by the OOM killer of the ancestor cgroup. By observing > >> the proportion of oom_kill_inherit in the parent cgroup, I can effectively adjust the > >> value of oom_protect to achieve the best. > > > >What does the best mean in this context? > > I have created a new indicator oom_kill_inherit that maintains a negative correlation > with memory.oom.protect, so we have a ruler to measure the optimal value of > memory.oom.protect. An example might help here. > >> about the semantics of non-leaf memcgs protection, > >> If a non-leaf memcg's oom_protect quota is set, its leaf memcg will proportionally > >> calculate the new effective oom_protect quota based on non-leaf memcg's quota. > > > >So the non-leaf memcg is never used as a target? What if the workload is > >distributed over several sub-groups? Our current oom.group > >implementation traverses the tree to find a common ancestor in the oom > >domain with the oom.group. > > If the oom_protect quota of the parent non-leaf memcg is less than the sum of > sub-groups oom_protect quota, the oom_protect quota of each sub-group will > be proportionally reduced > If the oom_protect quota of the parent non-leaf memcg is greater than the sum > of sub-groups oom_protect quota, the oom_protect quota of each sub-group > will be proportionally increased > The purpose of doing so is that users can set oom_protect quota according to > their own needs, and the system management process can set appropriate > oom_protect quota on the parent non-leaf memcg as the final cover, so that > the system management process can indirectly manage all user processes. I guess that you are trying to say that the oom protection has a standard hierarchical behavior. And that is fine, well, in fact it is mandatory for any control knob to have a sane hierarchical properties. But that doesn't address my above question. Let me try again. When is a non-leaf memcg potentially selected as the oom victim? It doesn't have any tasks directly but it might be a suitable target to kill a multi memcg based workload (e.g. a full container). > >All that being said and with the usecase described more specifically. I > >can see that memcg based oom victim selection makes some sense. That > >menas that it is always a memcg selected and all tasks withing killed. > >Memcg based protection can be used to evaluate which memcg to choose and > >the overall scheme should be still manageable. It would indeed resemble > >memory protection for the regular reclaim. > > > >One thing that is still not really clear to me is to how group vs. > >non-group ooms could be handled gracefully. Right now we can handle that > >because the oom selection is still process based but with the protection > >this will become more problematic as explained previously. Essentially > >we would need to enforce the oom selection to be memcg based for all > >memcgs. Maybe a mount knob? What do you think? > > There is a function in the patch to determine whether the oom_protect > mechanism is enabled. All memory.oom.protect nodes default to 0, so the function > <is_root_oom_protect> returns 0 by default. How can an admin determine what is the current oom detection logic? -- Michal Hocko SUSE Labs