On Thu 24-08-17 13:28:46, Roman Gushchin wrote: > Hi Michal! > > On Thu, Aug 24, 2017 at 01:47:06PM +0200, Michal Hocko wrote: > > This doesn't apply on top of mmotm cleanly. You are missing > > http://lkml.kernel.org/r/20170807113839.16695-3-mhocko@xxxxxxxxxx > > I'll rebase. Thanks! > > > > > On Wed 23-08-17 17:51:59, Roman Gushchin wrote: > > > Traditionally, the OOM killer is operating on a process level. > > > Under oom conditions, it finds a process with the highest oom score > > > and kills it. > > > > > > This behavior doesn't suit well the system with many running > > > containers: > > > > > > 1) There is no fairness between containers. A small container with > > > few large processes will be chosen over a large one with huge > > > number of small processes. > > > > > > 2) Containers often do not expect that some random process inside > > > will be killed. In many cases much safer behavior is to kill > > > all tasks in the container. Traditionally, this was implemented > > > in userspace, but doing it in the kernel has some advantages, > > > especially in a case of a system-wide OOM. > > > > > > 3) Per-process oom_score_adj affects global OOM, so it's a breache > > > in the isolation. > > > > Please explain more. I guess you mean that an untrusted memcg could hide > > itself from the global OOM killer by reducing the oom scores? Well you > > need CAP_SYS_RESOURCE do reduce the current oom_score{_adj} as David has > > already pointed out. I also agree that we absolutely must not kill an > > oom disabled task. I am pretty sure somebody is using OOM_SCORE_ADJ_MIN > > as a protection from an untrusted SIGKILL and inconsistent state as a > > result. Those applications simply shouldn't behave differently in the > > global and container contexts. > > The main point of the kill_all option is to clean up the victim cgroup > _completely_. If some tasks can survive, that means userspace should > take care of them, look at the cgroup after oom, and kill the survivors > manually. > > If you want to rely on OOM_SCORE_ADJ_MIN, don't set kill_all. > I really don't get the usecase for this "kill all, except this and that". OOM_SCORE_ADJ_MIN has become a contract de-facto. You cannot simply expect that somebody would alter a specific workload for a container just to be safe against unexpected SIGKILL. kill-all might be set up the memcg hierarchy which is out of your control. > Also, it's really confusing to respect -1000 value, and completely ignore -999. > > I believe that any complex userspace OOM handling should use memory.high > and handle memory shortage before an actual OOM. > > > > > If nothing else we have to skip OOM_SCORE_ADJ_MIN tasks during the kill. > > > > > To address these issues, cgroup-aware OOM killer is introduced. > > > > > > Under OOM conditions, it tries to find the biggest memory consumer, > > > and free memory by killing corresponding task(s). The difference > > > the "traditional" OOM killer is that it can treat memory cgroups > > > as memory consumers as well as single processes. > > > > > > By default, it will look for the biggest leaf cgroup, and kill > > > the largest task inside. > > > > Why? I believe that the semantic should be as simple as kill the largest > > oom killable entity. And the entity is either a process or a memcg which > > is marked that way. > > So, you still need to compare memcgroups and processes. > > In my case, it's more like an exception (only processes from root memcg, > and only if there are no eligible cgroups with lower oom_priority). > You suggest to rely on this comparison. > > > Why should we mix things and select a memcg to kill > > a process inside it? More on that below. > > To have some sort of "fairness" in a containerized environemnt. > Say, 1 cgroup with 1 big task, another cgroup with many smaller tasks. > It's not necessary true, that first one is a better victim. There is nothing like a "better victim". We are pretty much in a catastrophic situation when we try to survive by killing a userspace. We try to kill the largest because that assumes that we return the most memory from it. Now I do understand that you want to treat the memcg as a single killable entity but I find it really questionable to do a per-memcg metric and then do not treat it like that and kill only a single task. Just imagine a single memcg with zillions of taks each very small and you select it as the largest while a small taks itself doesn't help to help to get us out of the OOM. > > > But a user can change this behavior by enabling the per-cgroup > > > oom_kill_all_tasks option. If set, it causes the OOM killer treat > > > the whole cgroup as an indivisible memory consumer. In case if it's > > > selected as on OOM victim, all belonging tasks will be killed. > > > > > > Tasks in the root cgroup are treated as independent memory consumers, > > > and are compared with other memory consumers (e.g. leaf cgroups). > > > The root cgroup doesn't support the oom_kill_all_tasks feature. > > > > If anything you wouldn't have to treat the root memcg any special. It > > will be like any other memcg which doesn't have oom_kill_all_tasks... > > > > [...] > > > > > +static long memcg_oom_badness(struct mem_cgroup *memcg, > > > + const nodemask_t *nodemask) > > > +{ > > > + long points = 0; > > > + int nid; > > > + pg_data_t *pgdat; > > > + > > > + for_each_node_state(nid, N_MEMORY) { > > > + if (nodemask && !node_isset(nid, *nodemask)) > > > + continue; > > > + > > > + points += mem_cgroup_node_nr_lru_pages(memcg, nid, > > > + LRU_ALL_ANON | BIT(LRU_UNEVICTABLE)); > > > + > > > + pgdat = NODE_DATA(nid); > > > + points += lruvec_page_state(mem_cgroup_lruvec(pgdat, memcg), > > > + NR_SLAB_UNRECLAIMABLE); > > > + } > > > + > > > + points += memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) / > > > + (PAGE_SIZE / 1024); > > > + points += memcg_page_state(memcg, MEMCG_SOCK); > > > + points += memcg_page_state(memcg, MEMCG_SWAP); > > > + > > > + return points; > > > > I guess I have asked already and we haven't reached any consensus. I do > > not like how you treat memcgs and tasks differently. Why cannot we have > > a memcg score a sum of all its tasks? > > It sounds like a more expensive way to get almost the same with less accuracy. > Why it's better? because then you are comparing apples to apples? Besides that you have to check each task for over-killing anyway. So I do not see any performance merits here. > > How do you want to compare memcg score with tasks score? > > I have to do it for tasks in root cgroups, but it shouldn't be a common case. How come? I can easily imagine a setup where only some memcgs which really do need a kill-all semantic while all others can live with single task killed perfectly fine. -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe cgroups" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html