At 2022-12-01 00:27:54, "Michal Hocko" <mhocko@xxxxxxxx> wrote: >On Wed 30-11-22 15:46:19, 程垲涛 Chengkaitao Cheng wrote: >> On 2022-11-30 21:15:06, "Michal Hocko" <mhocko@xxxxxxxx> wrote: >> > On Wed 30-11-22 15:01:58, chengkaitao wrote: >> > > From: chengkaitao <pilgrimtao@xxxxxxxxx> >> > > >> > > We created a new interface <memory.oom.protect> for memory, If there is >> > > the OOM killer under parent memory cgroup, and the memory usage of a >> > > child cgroup is within its effective oom.protect boundary, the cgroup's >> > > tasks won't be OOM killed unless there is no unprotected tasks in other >> > > children cgroups. It draws on the logic of <memory.min/low> in the >> > > inheritance relationship. >> > >> > Could you be more specific about usecases? > >This is a very important question to answer. usecases 1: users say that they want to protect an important process with high memory consumption from being killed by the oom in case of docker container failure, so as to retain more critical on-site information or a self recovery mechanism. At this time, they suggest setting the score_adj of this process to -1000, but I don't agree with it, because the docker container is not important to other docker containers of the same physical machine. If score_adj of the process is set to -1000, the probability of oom in other container processes will increase. usecases 2: There are many business processes and agent processes mixed together on a physical machine, and they need to be classified and protected. However, some agents are the parents of business processes, and some business processes are the parents of agent processes, It will be troublesome to set different score_adj for them. Business processes and agents cannot determine which level their score_adj should be at, If we create another agent to set all processes's score_adj, we have to cycle through all the processes on the physical machine regularly, which looks stupid. >> > How do you tune oom.protect >> > wrt to other tunables? How does this interact with the oom_score_adj >> > tunining (e.g. a first hand oom victim with the score_adj 1000 sitting >> > in a oom protected memcg)? >> >> We prefer users to use score_adj and oom.protect independently. Score_adj is >> a parameter applicable to host, and oom.protect is a parameter applicable to cgroup. >> When the physical machine's memory size is particularly large, the score_adj >> granularity is also very large. However, oom.protect can achieve more fine-grained >> adjustment. > >Let me clarify a bit. I am not trying to defend oom_score_adj. It has >it's well known limitations and it is is essentially unusable for many >situations other than - hide or auto-select potential oom victim. > >> When the score_adj of the processes are the same, I list the following cases >> for explanation, >> >> root >> | >> cgroup A >> / \ >> cgroup B cgroup C >> (task m,n) (task x,y) >> >> score_adj(all task) = 0; >> oom.protect(cgroup A) = 0; >> oom.protect(cgroup B) = 0; >> oom.protect(cgroup C) = 3G; > >How can you enforce protection at C level without any protection at A >level? The basic idea of this scheme is that all processes in the same cgroup are equally important. If some processes need extra protection, a new cgroup needs to be created for unified settings. I don't think it is necessary to implement protection in cgroup C, because task x and task y are equally important. Only the four processes (task m, n, x and y) in cgroup A, have important and secondary differences. > This would easily allow arbitrary cgroup to hide from the oom > killer and spill over to other cgroups. I don't think this will happen, because eoom.protect only works on parent cgroup. If "oom.protect(parent cgroup) = 0", from perspective of grandpa cgroup, task x and y will not be specially protected. >> usage(task m) = 1G >> usage(task n) = 2G >> usage(task x) = 1G >> usage(task y) = 2G >> >> oom killer order of cgroup A: n > m > y > x >> oom killer order of host: y = n > x = m >> >> If cgroup A is a directory maintained by users, users can use oom.protect >> to protect relatively important tasks x and y. >> >> However, when score_adj and oom.protect are used at the same time, we >> will also consider the impact of both, as expressed in the following formula. >> but I have to admit that it is an unstable result. >> score = task_usage + score_adj * totalpage - eoom.protect * task_usage / local_memcg_usage > >I hope I am not misreading but this has some rather unexpected >properties. First off, bigger memory consumers in a protected memcg are >protected more. Since cgroup needs to reasonably distribute the protection quota to all processes in the cgroup, I think that processes consuming more memory should get more quota. It is fair to processes consuming less memory too, even if processes consuming more memory get more quota, its oom_score is still higher than the processes consuming less memory. When the oom killer appears in local cgroup, the order of oom killer remains unchanged >Also I would expect the protection discount would >be capped by the actual usage otherwise excessive protection >configuration could skew the results considerably. In the calculation, we will select the minimum value of memcg_usage and oom.protect >> > I haven't really read through the whole patch but this struck me odd. >> >> > > @@ -552,8 +552,19 @@ static int proc_oom_score(struct seq_file *m, struct pid_namespace *ns, >> > > unsigned long totalpages = totalram_pages() + total_swap_pages; >> > > unsigned long points = 0; >> > > long badness; >> > > +#ifdef CONFIG_MEMCG >> > > + struct mem_cgroup *memcg; >> > > >> > > - badness = oom_badness(task, totalpages); >> > > + rcu_read_lock(); >> > > + memcg = mem_cgroup_from_task(task); >> > > + if (memcg && !css_tryget(&memcg->css)) >> > > + memcg = NULL; >> > > + rcu_read_unlock(); >> > > + >> > > + update_parent_oom_protection(root_mem_cgroup, memcg); >> > > + css_put(&memcg->css); >> > > +#endif >> > > + badness = oom_badness(task, totalpages, MEMCG_OOM_PROTECT); >> > >> > the badness means different thing depending on which memcg hierarchy >> > subtree you look at. Scaling based on the global oom could get really >> > misleading. >> >> I also took it into consideration. I planned to change "/proc/pid/oom_score" >> to a writable node. When writing to different cgroup paths, different values >> will be output. The default output is root cgroup. Do you think this idea is >> feasible? > >I do not follow. Care to elaborate? Take two example, cmd: cat /proc/pid/oom_score output: Scaling based on the global oom cmd: echo "/cgroupA/cgroupB" > /proc/pid/oom_score output: Scaling based on the cgroupB oom (If the task is not in the cgroupB's hierarchy subtree, output: invalid parameter) Thanks for your comment! -- Chengkaitao Best wishes