Re: [v9 3/5] mm, oom: cgroup-aware OOM killer

Roman Gushchin <guro@xxxxxx> · Tue, 3 Oct 2017 13:37:21 +0100

On Tue, Oct 03, 2017 at 01:48:48PM +0200, Michal Hocko wrote:
> On Wed 27-09-17 14:09:34, Roman Gushchin wrote:
> > Traditionally, the OOM killer is operating on a process level.
> > Under oom conditions, it finds a process with the highest oom score
> > and kills it.
> > 
> > This behavior doesn't suit well the system with many running
> > containers:
> > 
> > 1) There is no fairness between containers. A small container with
> > few large processes will be chosen over a large one with huge
> > number of small processes.
> > 
> > 2) Containers often do not expect that some random process inside
> > will be killed. In many cases much safer behavior is to kill
> > all tasks in the container. Traditionally, this was implemented
> > in userspace, but doing it in the kernel has some advantages,
> > especially in a case of a system-wide OOM.
> > 
> > To address these issues, the cgroup-aware OOM killer is introduced.
> > 
> > Under OOM conditions, it looks for the biggest memory consumer:
> > a leaf memory cgroup or a memory cgroup with the memory.oom_group
> > option set. Then it kills either a task with the biggest memory
> > footprint, either all belonging tasks, if memory.oom_group is set.
> > If a cgroup has memory.oom_group set, all descendant cgroups
> > implicitly inherit the memory.oom_group setting.
> 
> I think it would be better to separate oom_group into its own patch.
> So this patch would just add the cgroup awareness and oom_group will
> build on top of that.

Sure, will do.

> 
> Wrt. to the implicit inheritance you brought up in a separate email
> thread [1]. Let me quote
> : after some additional thinking I don't think anymore that implicit
> : propagation of oom_group is a good idea.  Let me explain: assume we
> : have memcg A with memory.max and memory.oom_group set, and nested
> : memcg A/B with memory.max set. Let's imagine we have an OOM event if
> : A/B. What is an expected system behavior?
> : We have OOM scoped to A/B, and any action should be also scoped to A/B.
> : We really shouldn't touch processes which are not belonging to A/B.
> : That means we should either kill the biggest process in A/B, either all
> : processes in A/B. It's natural to make A/B/memory.oom_group responsible
> : for this decision. It's strange to make the depend on A/memory.oom_group, IMO.
> : It really makes no sense, and makes oom_group knob really hard to describe.
> : 
> : Also, after some off-list discussion, we've realized that memory.oom_knob
> : should be delegatable. The workload should have control over it to express
> : dependency between processes.
> 
> OK, I have asked about this already but I am not sure the answer was
> very explicit. So let me ask again. When exactly a subtree would
> disagree with the parent on oom_group? In other words when do we want a
> different cleanup based on the OOM root? I am not saying this is wrong
> I am just curious about a practical example.

Well, I do not have a practical example right now, but it's against the logic.
Any OOM event has a scope, and group_oom knob is applied for OOM events
scoped to the cgroup or any ancestors (including system as a whole).
So, applying it implicitly to OOM scoped to descendant cgroups makes no sense.
It's a strange configuration limitation, and I do not see any benefits:
it doesn't provide any new functionality or guarantees.

Even if we don't have practical examples, we should build something less
surprising for a user, and I don't understand why oom_group should be inherited.

> 
> > Tasks with oom_score_adj set to -1000 are considered as unkillable.
> > 
> > The root cgroup is treated as a leaf memory cgroup, so it's score
> > is compared with other leaf and oom_group memory cgroups.
> > The oom_group option is not supported for the root cgroup.
> > Due to memcg statistics implementation a special algorithm
> > is used for estimating root cgroup oom_score: we define it
> > as maximum oom_score of the belonging tasks.
> 
> [1] http://lkml.kernel.org/r/20171002124712.GA17638@xxxxxxxxxxxxxxxxxxxxxxxxxxx
> 
> [...]
> > +static long memcg_oom_badness(struct mem_cgroup *memcg,
> > +			      const nodemask_t *nodemask,
> > +			      unsigned long totalpages)
> > +{
> > +	long points = 0;
> > +	int nid;
> > +	pg_data_t *pgdat;
> > +
> > +	/*
> > +	 * We don't have necessary stats for the root memcg,
> > +	 * so we define it's oom_score as the maximum oom_score
> > +	 * of the belonging tasks.
> > +	 */
> 
> Why not a sum of all tasks which would more resemble what we do for
> other memcgs? Sure this would require ignoring oom_score_adj so
> oom_badness would have to be tweaked a bit (basically split it into
> __oom_badness which calculates the value without the bias and
> oom_badness on top adding the bias on top of the scaled value).

We've discussed it already: calculating the sum is tricky, as tasks
are sharing memory (and the mm struct(. As I remember, you suggested
using maximum to solve exactly this problem, and I think it's a good
approximation. Assuming that tasks in the root cgroup likely have
nothing in common, and we don't support oom_group for it, looking
at the biggest task makes perfect sense: we're exactly comparing
killable entities.

> 
> > +	if (memcg == root_mem_cgroup) {
> > +		struct css_task_iter it;
> > +		struct task_struct *task;
> > +		long score, max_score = 0;
> > +
> > +		css_task_iter_start(&memcg->css, 0, &it);
> > +		while ((task = css_task_iter_next(&it))) {
> > +			score = oom_badness(task, memcg, nodemask,
> > +					    totalpages);
> > +			if (score > max_score)
> > +				max_score = score;
> > +		}
> > +		css_task_iter_end(&it);
> > +
> > +		return max_score;
> > +	}
> > +
> > +	for_each_node_state(nid, N_MEMORY) {
> > +		if (nodemask && !node_isset(nid, *nodemask))
> > +			continue;
> > +
> > +		points += mem_cgroup_node_nr_lru_pages(memcg, nid,
> > +				LRU_ALL_ANON | BIT(LRU_UNEVICTABLE));
> > +
> > +		pgdat = NODE_DATA(nid);
> > +		points += lruvec_page_state(mem_cgroup_lruvec(pgdat, memcg),
> > +					    NR_SLAB_UNRECLAIMABLE);
> > +	}
> > +
> > +	points += memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) /
> > +		(PAGE_SIZE / 1024);
> > +	points += memcg_page_state(memcg, MEMCG_SOCK);
> > +	points += memcg_page_state(memcg, MEMCG_SWAP);
> > +
> > +	return points;
> > +}
> > +
> > +/*
> > + * Checks if the given memcg is a valid OOM victim and returns a number,
> > + * which means the folowing:
> > + *   -1: there are inflight OOM victim tasks, belonging to the memcg
> > + *    0: memcg is not eligible, e.g. all belonging tasks are protected
> > + *       by oom_score_adj set to OOM_SCORE_ADJ_MIN
> > + *   >0: memcg is eligible, and the returned value is an estimation
> > + *       of the memory footprint
> > + */
> > +static long oom_evaluate_memcg(struct mem_cgroup *memcg,
> > +			       const nodemask_t *nodemask,
> > +			       unsigned long totalpages)
> > +{
> > +	struct css_task_iter it;
> > +	struct task_struct *task;
> > +	int eligible = 0;
> > +
> > +	/*
> > +	 * Memcg is OOM eligible if there are OOM killable tasks inside.
> > +	 *
> > +	 * We treat tasks with oom_score_adj set to OOM_SCORE_ADJ_MIN
> > +	 * as unkillable.
> > +	 *
> > +	 * If there are inflight OOM victim tasks inside the memcg,
> > +	 * we return -1.
> > +	 */
> > +	css_task_iter_start(&memcg->css, 0, &it);
> > +	while ((task = css_task_iter_next(&it))) {
> > +		if (!eligible &&
> > +		    task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN)
> > +			eligible = 1;
> > +
> > +		if (tsk_is_oom_victim(task) &&
> > +		    !test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags)) {
> > +			eligible = -1;
> > +			break;
> > +		}
> > +	}
> > +	css_task_iter_end(&it);
> > +
> > +	if (eligible <= 0)
> > +		return eligible;
> > +
> > +	return memcg_oom_badness(memcg, nodemask, totalpages);
> > +}
> > +
> > +static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
> > +{
> > +	struct mem_cgroup *iter, *parent;
> > +
> > +	/*
> > +	 * If OOM is memcg-wide, and the memcg or it's ancestor has
> > +	 * the oom_group flag, simple select the memcg as a victim.
> > +	 */
> > +	if (oc->memcg && mem_cgroup_oom_group(oc->memcg)) {
> > +		oc->chosen_memcg = oc->memcg;
> > +		css_get(&oc->chosen_memcg->css);
> > +		oc->chosen_points = oc->memcg->oom_score;
> > +		return;
> > +	}
> > +
> > +	oc->chosen_memcg = NULL;
> > +
> > +	/*
> > +	 * The oom_score is calculated for leaf memcgs and propagated upwards
> > +	 * by the tree.
> > +	 *
> > +	 * for_each_mem_cgroup_tree() walks the tree in pre-order,
> > +	 * so we simple reset oom_score for non-lead cgroups before
> > +	 * starting accumulating an actual value from underlying sub-tree.
> > +	 *
> > +	 * Root memcg is treated as a leaf memcg.
> > +	 */
> > +	rcu_read_lock();
> > +	for_each_mem_cgroup_tree(iter, root) {
> > +		if (memcg_has_children(iter) && iter != root_mem_cgroup) {
> > +			iter->oom_score = 0;
> > +			continue;
> > +		}
> > +
> > +		iter->oom_score = oom_evaluate_memcg(iter, oc->nodemask,
> > +						     oc->totalpages);
> > +
> > +		/*
> > +		 * Ignore empty and non-eligible memory cgroups.
> > +		 */
> > +		if (iter->oom_score == 0)
> > +			continue;
> > +
> > +		/*
> > +		 * If there are inflight OOM victims, we don't need to look
> > +		 * further for new victims.
> > +		 */
> > +		if (iter->oom_score == -1) {
> > +			oc->chosen_memcg = INFLIGHT_VICTIM;
> > +			mem_cgroup_iter_break(root, iter);
> > +			break;
> > +		}
> > +
> > +		if (iter->oom_score > oc->chosen_points) {
> > +			oc->chosen_memcg = iter;
> > +			oc->chosen_points = iter->oom_score;
> > +		}
> > +
> > +		for (parent = parent_mem_cgroup(iter); parent && parent != root;
> > +		     parent = parent_mem_cgroup(parent)) {
> > +			parent->oom_score += iter->oom_score;
> > +
> > +			if (mem_cgroup_oom_group(parent) &&
> > +			    parent->oom_score > oc->chosen_points) {
> > +				oc->chosen_memcg = parent;
> > +				oc->chosen_points = parent->oom_score;
> > +			}
> > +		}
> > +	}
> > +
> > +	if (oc->chosen_memcg && oc->chosen_memcg != INFLIGHT_VICTIM)
> > +		css_get(&oc->chosen_memcg->css);
> > +
> > +	rcu_read_unlock();
> > +}
> 
> 
> As I've written in a private email, things will get much easier if you
> get rid of memcg->oom_score and simply do the recursive oom_score
> evaluation of eligible inter nodes. You would basically do
> 	for_each_mem_cgroup_tree(root, iter) {
> 		if (!memcg_oom_eligible(iter))
> 			continue;
> 
> 		oom_score = oom_evaluate_memcg(iter, mask);
> 		if (oom_score == -1) {
> 			oc->chosen_memcg = INFLIGHT_VICTIM;
> 			mem_cgroup_iter_break(root, iter);
> 			break;
> 		}
> 		if (oom_score > oc->chosen_points) {
> 			mark_new_oom_memcg(iter);
> 		}
> 
> 		/* potential optimization to skip the whole subtree if
> 		 * iter is not leaf */
> 	}
> 
> where
> bool memcg_oom_eligible(struct mem_cgroup *memcg)
> {
> 	if (cgroup_has_tasks(memcg->css.cgroup))
> 		return true;
> 	if (mem_cgroup_oom_group(memcg))
> 		return true;
> 	return false;
> }
> 
> unsigned long __oom_evaluate_memcg(struct mem_cgroup *memcg, mask)
> {
> 	/* check eligible tasks - oom victims OOM_SCORE_ADJ_MIN */
> 	/* calculate badness */
> }
> 
> unsigned long oom_evaluate_memcg(struct mem_cgroup *memcg, mask)
> {
> 	unsigned long score = 0;
> 
> 	if (memcg == root_mem_cgroup) {
> 		for_each_task()
> 			score += __oom_badness(task, mask);
> 		return score
> 	}
> 
> 	for_each_mem_cgroup_tree(memcg, iter) {
> 		unsigned long memcg_score = __oom_evaluate_memcg(iter, mask);
> 		if (memcg_score == -1) {
> 			mem_cgroup_iter_break(memcg, iter);
> 			return -1;
> 		}
> 	}
> 
> 	return score;
> }
> 
> This should be also simple to split for oom_group in a separate patch
> while keeping the overall code structure.
> Does this make any sense to you?

I totally agree that getting rid of memcg->oom_score is possible and is
a good idea, as well as separating oom_group into a separate patch.

What about the rest, let me check what I can do here.

> 
> [...]
> > @@ -962,6 +968,48 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
> >  	__oom_kill_process(victim);
> >  }
> >  
> > +static int oom_kill_memcg_member(struct task_struct *task, void *unused)
> > +{
> > +	if (!tsk_is_oom_victim(task)) {
> 
> How can this happen?

We do start with killing the largest process, and then iterate over all tasks
in the cgroup. So, this check is required to avoid killing tasks which are
already in the termination process.

> 
> > +		get_task_struct(task);
> > +		__oom_kill_process(task);
> > +	}
> > +	return 0;
> > +}
> > +
> > +static bool oom_kill_memcg_victim(struct oom_control *oc)
> > +{
> > +	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
> > +				      DEFAULT_RATELIMIT_BURST);
> > +
> > +	if (oc->chosen_memcg == NULL || oc->chosen_memcg == INFLIGHT_VICTIM)
> > +		return oc->chosen_memcg;
> > +
> > +	/* Always begin with the task with the biggest memory footprint */
> > +	oc->chosen_points = 0;
> > +	oc->chosen_task = NULL;
> > +	mem_cgroup_scan_tasks(oc->chosen_memcg, oom_evaluate_task, oc);
> > +
> > +	if (oc->chosen_task == NULL || oc->chosen_task == INFLIGHT_VICTIM)
> > +		goto out;
> > +
> > +	if (__ratelimit(&oom_rs))
> > +		dump_header(oc, oc->chosen_task);
> 
> Hmm, does the full dump_header really apply for the new heuristic? E.g.
> does it make sense to dump_tasks()? Would it make sense to print stats
> of all eligible memcgs instead?

Hm, this is a tricky part: the dmesg output is at some point a part of ABI,
but is also closely connected with the implementation. So I would suggest
to postpone this until we'll get more usage examples and will better
understand what information we need.

> 
> > +
> > +	__oom_kill_process(oc->chosen_task);
> > +
> > +	/* If oom_group flag is set, kill all belonging tasks */
> > +	if (mem_cgroup_oom_group(oc->chosen_memcg))
> > +		mem_cgroup_scan_tasks(oc->chosen_memcg, oom_kill_memcg_member,
> > +				      NULL);
> > +
> > +	schedule_timeout_killable(1);
> 
> I would prefer if we had this timeout at a single place in
> out_of_memory()

Ok, will do.

> 
> Other than that the semantic (sans oom_group which needs more
> clarification) makes sense to me.

Cool!

Glad to hear this.

Thanks!
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html