On Wed 27-09-17 14:09:34, Roman Gushchin wrote: > Traditionally, the OOM killer is operating on a process level. > Under oom conditions, it finds a process with the highest oom score > and kills it. > > This behavior doesn't suit well the system with many running > containers: > > 1) There is no fairness between containers. A small container with > few large processes will be chosen over a large one with huge > number of small processes. > > 2) Containers often do not expect that some random process inside > will be killed. In many cases much safer behavior is to kill > all tasks in the container. Traditionally, this was implemented > in userspace, but doing it in the kernel has some advantages, > especially in a case of a system-wide OOM. > > To address these issues, the cgroup-aware OOM killer is introduced. > > Under OOM conditions, it looks for the biggest memory consumer: > a leaf memory cgroup or a memory cgroup with the memory.oom_group > option set. Then it kills either a task with the biggest memory > footprint, either all belonging tasks, if memory.oom_group is set. > If a cgroup has memory.oom_group set, all descendant cgroups > implicitly inherit the memory.oom_group setting. I think it would be better to separate oom_group into its own patch. So this patch would just add the cgroup awareness and oom_group will build on top of that. Wrt. to the implicit inheritance you brought up in a separate email thread [1]. Let me quote : after some additional thinking I don't think anymore that implicit : propagation of oom_group is a good idea. Let me explain: assume we : have memcg A with memory.max and memory.oom_group set, and nested : memcg A/B with memory.max set. Let's imagine we have an OOM event if : A/B. What is an expected system behavior? : We have OOM scoped to A/B, and any action should be also scoped to A/B. : We really shouldn't touch processes which are not belonging to A/B. : That means we should either kill the biggest process in A/B, either all : processes in A/B. It's natural to make A/B/memory.oom_group responsible : for this decision. It's strange to make the depend on A/memory.oom_group, IMO. : It really makes no sense, and makes oom_group knob really hard to describe. : : Also, after some off-list discussion, we've realized that memory.oom_knob : should be delegatable. The workload should have control over it to express : dependency between processes. OK, I have asked about this already but I am not sure the answer was very explicit. So let me ask again. When exactly a subtree would disagree with the parent on oom_group? In other words when do we want a different cleanup based on the OOM root? I am not saying this is wrong I am just curious about a practical example. > Tasks with oom_score_adj set to -1000 are considered as unkillable. > > The root cgroup is treated as a leaf memory cgroup, so it's score > is compared with other leaf and oom_group memory cgroups. > The oom_group option is not supported for the root cgroup. > Due to memcg statistics implementation a special algorithm > is used for estimating root cgroup oom_score: we define it > as maximum oom_score of the belonging tasks. [1] http://lkml.kernel.org/r/20171002124712.GA17638@xxxxxxxxxxxxxxxxxxxxxxxxxxx [...] > +static long memcg_oom_badness(struct mem_cgroup *memcg, > + const nodemask_t *nodemask, > + unsigned long totalpages) > +{ > + long points = 0; > + int nid; > + pg_data_t *pgdat; > + > + /* > + * We don't have necessary stats for the root memcg, > + * so we define it's oom_score as the maximum oom_score > + * of the belonging tasks. > + */ Why not a sum of all tasks which would more resemble what we do for other memcgs? Sure this would require ignoring oom_score_adj so oom_badness would have to be tweaked a bit (basically split it into __oom_badness which calculates the value without the bias and oom_badness on top adding the bias on top of the scaled value). > + if (memcg == root_mem_cgroup) { > + struct css_task_iter it; > + struct task_struct *task; > + long score, max_score = 0; > + > + css_task_iter_start(&memcg->css, 0, &it); > + while ((task = css_task_iter_next(&it))) { > + score = oom_badness(task, memcg, nodemask, > + totalpages); > + if (score > max_score) > + max_score = score; > + } > + css_task_iter_end(&it); > + > + return max_score; > + } > + > + for_each_node_state(nid, N_MEMORY) { > + if (nodemask && !node_isset(nid, *nodemask)) > + continue; > + > + points += mem_cgroup_node_nr_lru_pages(memcg, nid, > + LRU_ALL_ANON | BIT(LRU_UNEVICTABLE)); > + > + pgdat = NODE_DATA(nid); > + points += lruvec_page_state(mem_cgroup_lruvec(pgdat, memcg), > + NR_SLAB_UNRECLAIMABLE); > + } > + > + points += memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) / > + (PAGE_SIZE / 1024); > + points += memcg_page_state(memcg, MEMCG_SOCK); > + points += memcg_page_state(memcg, MEMCG_SWAP); > + > + return points; > +} > + > +/* > + * Checks if the given memcg is a valid OOM victim and returns a number, > + * which means the folowing: > + * -1: there are inflight OOM victim tasks, belonging to the memcg > + * 0: memcg is not eligible, e.g. all belonging tasks are protected > + * by oom_score_adj set to OOM_SCORE_ADJ_MIN > + * >0: memcg is eligible, and the returned value is an estimation > + * of the memory footprint > + */ > +static long oom_evaluate_memcg(struct mem_cgroup *memcg, > + const nodemask_t *nodemask, > + unsigned long totalpages) > +{ > + struct css_task_iter it; > + struct task_struct *task; > + int eligible = 0; > + > + /* > + * Memcg is OOM eligible if there are OOM killable tasks inside. > + * > + * We treat tasks with oom_score_adj set to OOM_SCORE_ADJ_MIN > + * as unkillable. > + * > + * If there are inflight OOM victim tasks inside the memcg, > + * we return -1. > + */ > + css_task_iter_start(&memcg->css, 0, &it); > + while ((task = css_task_iter_next(&it))) { > + if (!eligible && > + task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) > + eligible = 1; > + > + if (tsk_is_oom_victim(task) && > + !test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags)) { > + eligible = -1; > + break; > + } > + } > + css_task_iter_end(&it); > + > + if (eligible <= 0) > + return eligible; > + > + return memcg_oom_badness(memcg, nodemask, totalpages); > +} > + > +static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc) > +{ > + struct mem_cgroup *iter, *parent; > + > + /* > + * If OOM is memcg-wide, and the memcg or it's ancestor has > + * the oom_group flag, simple select the memcg as a victim. > + */ > + if (oc->memcg && mem_cgroup_oom_group(oc->memcg)) { > + oc->chosen_memcg = oc->memcg; > + css_get(&oc->chosen_memcg->css); > + oc->chosen_points = oc->memcg->oom_score; > + return; > + } > + > + oc->chosen_memcg = NULL; > + > + /* > + * The oom_score is calculated for leaf memcgs and propagated upwards > + * by the tree. > + * > + * for_each_mem_cgroup_tree() walks the tree in pre-order, > + * so we simple reset oom_score for non-lead cgroups before > + * starting accumulating an actual value from underlying sub-tree. > + * > + * Root memcg is treated as a leaf memcg. > + */ > + rcu_read_lock(); > + for_each_mem_cgroup_tree(iter, root) { > + if (memcg_has_children(iter) && iter != root_mem_cgroup) { > + iter->oom_score = 0; > + continue; > + } > + > + iter->oom_score = oom_evaluate_memcg(iter, oc->nodemask, > + oc->totalpages); > + > + /* > + * Ignore empty and non-eligible memory cgroups. > + */ > + if (iter->oom_score == 0) > + continue; > + > + /* > + * If there are inflight OOM victims, we don't need to look > + * further for new victims. > + */ > + if (iter->oom_score == -1) { > + oc->chosen_memcg = INFLIGHT_VICTIM; > + mem_cgroup_iter_break(root, iter); > + break; > + } > + > + if (iter->oom_score > oc->chosen_points) { > + oc->chosen_memcg = iter; > + oc->chosen_points = iter->oom_score; > + } > + > + for (parent = parent_mem_cgroup(iter); parent && parent != root; > + parent = parent_mem_cgroup(parent)) { > + parent->oom_score += iter->oom_score; > + > + if (mem_cgroup_oom_group(parent) && > + parent->oom_score > oc->chosen_points) { > + oc->chosen_memcg = parent; > + oc->chosen_points = parent->oom_score; > + } > + } > + } > + > + if (oc->chosen_memcg && oc->chosen_memcg != INFLIGHT_VICTIM) > + css_get(&oc->chosen_memcg->css); > + > + rcu_read_unlock(); > +} As I've written in a private email, things will get much easier if you get rid of memcg->oom_score and simply do the recursive oom_score evaluation of eligible inter nodes. You would basically do for_each_mem_cgroup_tree(root, iter) { if (!memcg_oom_eligible(iter)) continue; oom_score = oom_evaluate_memcg(iter, mask); if (oom_score == -1) { oc->chosen_memcg = INFLIGHT_VICTIM; mem_cgroup_iter_break(root, iter); break; } if (oom_score > oc->chosen_points) { mark_new_oom_memcg(iter); } /* potential optimization to skip the whole subtree if * iter is not leaf */ } where bool memcg_oom_eligible(struct mem_cgroup *memcg) { if (cgroup_has_tasks(memcg->css.cgroup)) return true; if (mem_cgroup_oom_group(memcg)) return true; return false; } unsigned long __oom_evaluate_memcg(struct mem_cgroup *memcg, mask) { /* check eligible tasks - oom victims OOM_SCORE_ADJ_MIN */ /* calculate badness */ } unsigned long oom_evaluate_memcg(struct mem_cgroup *memcg, mask) { unsigned long score = 0; if (memcg == root_mem_cgroup) { for_each_task() score += __oom_badness(task, mask); return score } for_each_mem_cgroup_tree(memcg, iter) { unsigned long memcg_score = __oom_evaluate_memcg(iter, mask); if (memcg_score == -1) { mem_cgroup_iter_break(memcg, iter); return -1; } } return score; } This should be also simple to split for oom_group in a separate patch while keeping the overall code structure. Does this make any sense to you? [...] > @@ -962,6 +968,48 @@ static void oom_kill_process(struct oom_control *oc, const char *message) > __oom_kill_process(victim); > } > > +static int oom_kill_memcg_member(struct task_struct *task, void *unused) > +{ > + if (!tsk_is_oom_victim(task)) { How can this happen? > + get_task_struct(task); > + __oom_kill_process(task); > + } > + return 0; > +} > + > +static bool oom_kill_memcg_victim(struct oom_control *oc) > +{ > + static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL, > + DEFAULT_RATELIMIT_BURST); > + > + if (oc->chosen_memcg == NULL || oc->chosen_memcg == INFLIGHT_VICTIM) > + return oc->chosen_memcg; > + > + /* Always begin with the task with the biggest memory footprint */ > + oc->chosen_points = 0; > + oc->chosen_task = NULL; > + mem_cgroup_scan_tasks(oc->chosen_memcg, oom_evaluate_task, oc); > + > + if (oc->chosen_task == NULL || oc->chosen_task == INFLIGHT_VICTIM) > + goto out; > + > + if (__ratelimit(&oom_rs)) > + dump_header(oc, oc->chosen_task); Hmm, does the full dump_header really apply for the new heuristic? E.g. does it make sense to dump_tasks()? Would it make sense to print stats of all eligible memcgs instead? > + > + __oom_kill_process(oc->chosen_task); > + > + /* If oom_group flag is set, kill all belonging tasks */ > + if (mem_cgroup_oom_group(oc->chosen_memcg)) > + mem_cgroup_scan_tasks(oc->chosen_memcg, oom_kill_memcg_member, > + NULL); > + > + schedule_timeout_killable(1); I would prefer if we had this timeout at a single place in out_of_memory() Other than that the semantic (sans oom_group which needs more clarification) makes sense to me. -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe cgroups" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html