Re: [PATCH v3 0/2] memcontrol: support cgroup level OOM protection

程垲涛 Chengkaitao Cheng <chengkaitao@xxxxxxxxxxxxxx> · Tue, 9 May 2023 06:50:59 +0000

At 2023-05-08 22:18:18, "Michal Hocko" <mhocko@xxxxxxxx> wrote:
>On Mon 08-05-23 09:08:25, 程垲涛 Chengkaitao Cheng wrote:
>> At 2023-05-07 18:11:58, "Michal Hocko" <mhocko@xxxxxxxx> wrote:
>> >On Sat 06-05-23 19:49:46, chengkaitao wrote:
>> >
>> >That being said, make sure you describe your usecase more thoroughly.
>> >Please also make sure you describe the intended heuristic of the knob.
>> >It is not really clear from the description how this fits hierarchical
>> >behavior of cgroups. I would be especially interested in the semantics
>> >of non-leaf memcgs protection as they do not have any actual processes
>> >to protect.
>> >
>> >Also there have been concerns mentioned in v2 discussion and it would be
>> >really appreciated to summarize how you have dealt with them.
>> >
>> >Please also note that many people are going to be slow in responding
>> >this week because of LSFMM conference
>> >(https://events.linuxfoundation.org/lsfmm/)
>> 
>> Here is a more detailed comparison and introduction of the old oom_score_adj
>> mechanism and the new oom_protect mechanism,
>> 1. The regulating granularity of oom_protect is smaller than that of oom_score_adj.
>> On a 512G physical machine, the minimum granularity adjusted by oom_score_adj
>> is 512M, and the minimum granularity adjusted by oom_protect is one page (4K).
>> 2. It may be simple to create a lightweight parent process and uniformly set the 
>> oom_score_adj of some important processes, but it is not a simple matter to make 
>> multi-level settings for tens of thousands of processes on the physical machine 
>> through the lightweight parent processes. We may need a huge table to record the 
>> value of oom_score_adj maintained by all lightweight parent processes, and the 
>> user process limited by the parent process has no ability to change its own 
>> oom_score_adj, because it does not know the details of the huge table. The new 
>> patch adopts the cgroup mechanism. It does not need any parent process to manage 
>> oom_score_adj. the settings between each memcg are independent of each other, 
>> making it easier to plan the OOM order of all processes. Due to the unique nature 
>> of memory resources, current Service cloud vendors are not oversold in memory 
>> planning. I would like to use the new patch to try to achieve the possibility of 
>> oversold memory resources.
>
>OK, this is more specific about the usecase. Thanks! So essentially what
>it boils down to is that you are handling many containers (memcgs from
>our POV) and they have different priorities. You want to overcommit the
>memory to the extend that global ooms are not an unexpected event. Once
>that happens the total memory consumption of a specific memcg is less
>important than its "priority". You define that priority by the excess of
>the memory usage above a user defined threshold. Correct?

It's correct.

>Your cover letter mentions that then "all processes in the cgroup as a
>whole". That to me reads as oom.group oom killer policy. But a brief
>look into the patch suggests you are still looking at specific tasks and
>this has been a concern in the previous version of the patch because
>memcg accounting and per-process accounting are detached.

I think the memcg accounting may be more reasonable, as its memory 
statistics are more comprehensive, similar to active page cache, which 
also increases the probability of OOM-kill. In the new patch, all the 
shared memory will also consume the oom_protect quota of the memcg, 
and the process's oom_protect quota of the memcg will decrease.

>> 3. I conducted a test and deployed an excessive number of containers on a physical 
>> machine, By setting the oom_score_adj value of all processes in the container to 
>> a positive number through dockerinit, even processes that occupy very little memory 
>> in the container are easily killed, resulting in a large number of invalid kill behaviors. 
>> If dockerinit is also killed unfortunately, it will trigger container self-healing, and the 
>> container will rebuild, resulting in more severe memory oscillations. The new patch 
>> abandons the behavior of adding an equal amount of oom_score_adj to each process 
>> in the container and adopts a shared oom_protect quota for all processes in the container. 
>> If a process in the container is killed, the remaining other processes will receive more 
>> oom_protect quota, making it more difficult for the remaining processes to be killed.
>> In my test case, the new patch reduced the number of invalid kill behaviors by 70%.
>> 4. oom_score_adj is a global configuration that cannot achieve a kill order that only 
>> affects a certain memcg-oom-killer. However, the oom_protect mechanism inherits 
>> downwards, and user can only change the kill order of its own memcg oom, but the 
>> kill order of their parent memcg-oom-killer or global-oom-killer will not be affected
>
>Yes oom_score_adj has shortcomings.
>
>> In the final discussion of patch v2, we discussed that although the adjustment range 
>> of oom_score_adj is [-1000,1000], but essentially it only allows two usecases
>> (OOM_SCORE_ADJ_MIN, OOM_SCORE_ADJ_MAX) reliably. Everything in between is 
>> clumsy at best. In order to solve this problem in the new patch, I introduced a new 
>> indicator oom_kill_inherit, which counts the number of times the local and child 
>> cgroups have been selected by the OOM killer of the ancestor cgroup. By observing 
>> the proportion of oom_kill_inherit in the parent cgroup, I can effectively adjust the 
>> value of oom_protect to achieve the best.
>
>What does the best mean in this context?

I have created a new indicator oom_kill_inherit that maintains a negative correlation 
with memory.oom.protect, so we have a ruler to measure the optimal value of 
memory.oom.protect.

>> about the semantics of non-leaf memcgs protection,
>> If a non-leaf memcg's oom_protect quota is set, its leaf memcg will proportionally 
>> calculate the new effective oom_protect quota based on non-leaf memcg's quota.
>
>So the non-leaf memcg is never used as a target? What if the workload is
>distributed over several sub-groups? Our current oom.group
>implementation traverses the tree to find a common ancestor in the oom
>domain with the oom.group.

If the oom_protect quota of the parent non-leaf memcg is less than the sum of 
sub-groups oom_protect quota, the oom_protect quota of each sub-group will 
be proportionally reduced
If the oom_protect quota of the parent non-leaf memcg is greater than the sum 
of sub-groups oom_protect quota, the oom_protect quota of each sub-group 
will be proportionally increased
The purpose of doing so is that users can set oom_protect quota according to 
their own needs, and the system management process can set appropriate 
oom_protect quota on the parent non-leaf memcg as the final cover, so that 
the system management process can indirectly manage all user processes.

>All that being said and with the usecase described more specifically. I
>can see that memcg based oom victim selection makes some sense. That
>menas that it is always a memcg selected and all tasks withing killed.
>Memcg based protection can be used to evaluate which memcg to choose and
>the overall scheme should be still manageable. It would indeed resemble
>memory protection for the regular reclaim.
>
>One thing that is still not really clear to me is to how group vs.
>non-group ooms could be handled gracefully. Right now we can handle that
>because the oom selection is still process based but with the protection
>this will become more problematic as explained previously. Essentially
>we would need to enforce the oom selection to be memcg based for all
>memcgs. Maybe a mount knob? What do you think?

There is a function in the patch to determine whether the oom_protect 
mechanism is enabled. All memory.oom.protect nodes default to 0, so the function 
<is_root_oom_protect> returns 0 by default. The oom_protect  mechanism will 
only take effect when "root_mem_cgroup->memory.children_oom_protect_usage" 
is not 0, and only memcg with memory.oom.protect node set will take effect.

+bool is_root_oom_protect(void)
+{
+	if (mem_cgroup_disabled())
+		return 0;
+
+	return !!atomic_long_read(&root_mem_cgroup->memory.children_oom_protect_usage);
+}
I don't know if there is some problems with my understanding?

-- 
Thanks for your comment!
chengkaitao