At 2023-05-08 22:18:18, "Michal Hocko" <mhocko@xxxxxxxx> wrote: >On Mon 08-05-23 09:08:25, 程垲涛 Chengkaitao Cheng wrote: >> At 2023-05-07 18:11:58, "Michal Hocko" <mhocko@xxxxxxxx> wrote: >> >On Sat 06-05-23 19:49:46, chengkaitao wrote: >> > >> >That being said, make sure you describe your usecase more thoroughly. >> >Please also make sure you describe the intended heuristic of the knob. >> >It is not really clear from the description how this fits hierarchical >> >behavior of cgroups. I would be especially interested in the semantics >> >of non-leaf memcgs protection as they do not have any actual processes >> >to protect. >> > >> >Also there have been concerns mentioned in v2 discussion and it would be >> >really appreciated to summarize how you have dealt with them. >> > >> >Please also note that many people are going to be slow in responding >> >this week because of LSFMM conference >> >(https://events.linuxfoundation.org/lsfmm/) >> >> Here is a more detailed comparison and introduction of the old oom_score_adj >> mechanism and the new oom_protect mechanism, >> 1. The regulating granularity of oom_protect is smaller than that of oom_score_adj. >> On a 512G physical machine, the minimum granularity adjusted by oom_score_adj >> is 512M, and the minimum granularity adjusted by oom_protect is one page (4K). >> 2. It may be simple to create a lightweight parent process and uniformly set the >> oom_score_adj of some important processes, but it is not a simple matter to make >> multi-level settings for tens of thousands of processes on the physical machine >> through the lightweight parent processes. We may need a huge table to record the >> value of oom_score_adj maintained by all lightweight parent processes, and the >> user process limited by the parent process has no ability to change its own >> oom_score_adj, because it does not know the details of the huge table. The new >> patch adopts the cgroup mechanism. It does not need any parent process to manage >> oom_score_adj. the settings between each memcg are independent of each other, >> making it easier to plan the OOM order of all processes. Due to the unique nature >> of memory resources, current Service cloud vendors are not oversold in memory >> planning. I would like to use the new patch to try to achieve the possibility of >> oversold memory resources. > >OK, this is more specific about the usecase. Thanks! So essentially what >it boils down to is that you are handling many containers (memcgs from >our POV) and they have different priorities. You want to overcommit the >memory to the extend that global ooms are not an unexpected event. Once >that happens the total memory consumption of a specific memcg is less >important than its "priority". You define that priority by the excess of >the memory usage above a user defined threshold. Correct? It's correct. >Your cover letter mentions that then "all processes in the cgroup as a >whole". That to me reads as oom.group oom killer policy. But a brief >look into the patch suggests you are still looking at specific tasks and >this has been a concern in the previous version of the patch because >memcg accounting and per-process accounting are detached. I think the memcg accounting may be more reasonable, as its memory statistics are more comprehensive, similar to active page cache, which also increases the probability of OOM-kill. In the new patch, all the shared memory will also consume the oom_protect quota of the memcg, and the process's oom_protect quota of the memcg will decrease. >> 3. I conducted a test and deployed an excessive number of containers on a physical >> machine, By setting the oom_score_adj value of all processes in the container to >> a positive number through dockerinit, even processes that occupy very little memory >> in the container are easily killed, resulting in a large number of invalid kill behaviors. >> If dockerinit is also killed unfortunately, it will trigger container self-healing, and the >> container will rebuild, resulting in more severe memory oscillations. The new patch >> abandons the behavior of adding an equal amount of oom_score_adj to each process >> in the container and adopts a shared oom_protect quota for all processes in the container. >> If a process in the container is killed, the remaining other processes will receive more >> oom_protect quota, making it more difficult for the remaining processes to be killed. >> In my test case, the new patch reduced the number of invalid kill behaviors by 70%. >> 4. oom_score_adj is a global configuration that cannot achieve a kill order that only >> affects a certain memcg-oom-killer. However, the oom_protect mechanism inherits >> downwards, and user can only change the kill order of its own memcg oom, but the >> kill order of their parent memcg-oom-killer or global-oom-killer will not be affected > >Yes oom_score_adj has shortcomings. > >> In the final discussion of patch v2, we discussed that although the adjustment range >> of oom_score_adj is [-1000,1000], but essentially it only allows two usecases >> (OOM_SCORE_ADJ_MIN, OOM_SCORE_ADJ_MAX) reliably. Everything in between is >> clumsy at best. In order to solve this problem in the new patch, I introduced a new >> indicator oom_kill_inherit, which counts the number of times the local and child >> cgroups have been selected by the OOM killer of the ancestor cgroup. By observing >> the proportion of oom_kill_inherit in the parent cgroup, I can effectively adjust the >> value of oom_protect to achieve the best. > >What does the best mean in this context? I have created a new indicator oom_kill_inherit that maintains a negative correlation with memory.oom.protect, so we have a ruler to measure the optimal value of memory.oom.protect. >> about the semantics of non-leaf memcgs protection, >> If a non-leaf memcg's oom_protect quota is set, its leaf memcg will proportionally >> calculate the new effective oom_protect quota based on non-leaf memcg's quota. > >So the non-leaf memcg is never used as a target? What if the workload is >distributed over several sub-groups? Our current oom.group >implementation traverses the tree to find a common ancestor in the oom >domain with the oom.group. If the oom_protect quota of the parent non-leaf memcg is less than the sum of sub-groups oom_protect quota, the oom_protect quota of each sub-group will be proportionally reduced If the oom_protect quota of the parent non-leaf memcg is greater than the sum of sub-groups oom_protect quota, the oom_protect quota of each sub-group will be proportionally increased The purpose of doing so is that users can set oom_protect quota according to their own needs, and the system management process can set appropriate oom_protect quota on the parent non-leaf memcg as the final cover, so that the system management process can indirectly manage all user processes. >All that being said and with the usecase described more specifically. I >can see that memcg based oom victim selection makes some sense. That >menas that it is always a memcg selected and all tasks withing killed. >Memcg based protection can be used to evaluate which memcg to choose and >the overall scheme should be still manageable. It would indeed resemble >memory protection for the regular reclaim. > >One thing that is still not really clear to me is to how group vs. >non-group ooms could be handled gracefully. Right now we can handle that >because the oom selection is still process based but with the protection >this will become more problematic as explained previously. Essentially >we would need to enforce the oom selection to be memcg based for all >memcgs. Maybe a mount knob? What do you think? There is a function in the patch to determine whether the oom_protect mechanism is enabled. All memory.oom.protect nodes default to 0, so the function <is_root_oom_protect> returns 0 by default. The oom_protect mechanism will only take effect when "root_mem_cgroup->memory.children_oom_protect_usage" is not 0, and only memcg with memory.oom.protect node set will take effect. +bool is_root_oom_protect(void) +{ + if (mem_cgroup_disabled()) + return 0; + + return !!atomic_long_read(&root_mem_cgroup->memory.children_oom_protect_usage); +} I don't know if there is some problems with my understanding? -- Thanks for your comment! chengkaitao