On 2022-11-30 21:15:06, "Michal Hocko" <mhocko@xxxxxxxx> wrote: > On Wed 30-11-22 15:01:58, chengkaitao wrote: > > From: chengkaitao <pilgrimtao@xxxxxxxxx> > > > > We created a new interface <memory.oom.protect> for memory, If there is > > the OOM killer under parent memory cgroup, and the memory usage of a > > child cgroup is within its effective oom.protect boundary, the cgroup's > > tasks won't be OOM killed unless there is no unprotected tasks in other > > children cgroups. It draws on the logic of <memory.min/low> in the > > inheritance relationship. > > Could you be more specific about usecases? How do you tune oom.protect > wrt to other tunables? How does this interact with the oom_score_adj > tunining (e.g. a first hand oom victim with the score_adj 1000 sitting > in a oom protected memcg)? We prefer users to use score_adj and oom.protect independently. Score_adj is a parameter applicable to host, and oom.protect is a parameter applicable to cgroup. When the physical machine's memory size is particularly large, the score_adj granularity is also very large. However, oom.protect can achieve more fine-grained adjustment. When the score_adj of the processes are the same, I list the following cases for explanation, root | cgroup A / \ cgroup B cgroup C (task m,n) (task x,y) score_adj(all task) = 0; oom.protect(cgroup A) = 0; oom.protect(cgroup B) = 0; oom.protect(cgroup C) = 3G; usage(task m) = 1G usage(task n) = 2G usage(task x) = 1G usage(task y) = 2G oom killer order of cgroup A: n > m > y > x oom killer order of host: y = n > x = m If cgroup A is a directory maintained by users, users can use oom.protect to protect relatively important tasks x and y. However, when score_adj and oom.protect are used at the same time, we will also consider the impact of both, as expressed in the following formula. but I have to admit that it is an unstable result. score = task_usage + score_adj * totalpage - eoom.protect * task_usage / local_memcg_usage > I haven't really read through the whole patch but this struck me odd. > > @@ -552,8 +552,19 @@ static int proc_oom_score(struct seq_file *m, struct pid_namespace *ns, > > unsigned long totalpages = totalram_pages() + total_swap_pages; > > unsigned long points = 0; > > long badness; > > +#ifdef CONFIG_MEMCG > > + struct mem_cgroup *memcg; > > > > - badness = oom_badness(task, totalpages); > > + rcu_read_lock(); > > + memcg = mem_cgroup_from_task(task); > > + if (memcg && !css_tryget(&memcg->css)) > > + memcg = NULL; > > + rcu_read_unlock(); > > + > > + update_parent_oom_protection(root_mem_cgroup, memcg); > > + css_put(&memcg->css); > > +#endif > > + badness = oom_badness(task, totalpages, MEMCG_OOM_PROTECT); > > the badness means different thing depending on which memcg hierarchy > subtree you look at. Scaling based on the global oom could get really > misleading. I also took it into consideration. I planned to change "/proc/pid/oom_score" to a writable node. When writing to different cgroup paths, different values will be output. The default output is root cgroup. Do you think this idea is feasible? -- Chengkaitao Best wishes