On Fri 23-06-17 19:39:46, Roman Gushchin wrote: > On Fri, Jun 23, 2017 at 03:43:24PM +0200, Michal Hocko wrote: > > On Thu 22-06-17 18:10:03, Roman Gushchin wrote: > > > Hi, Michal! > > > > > > Thank you very much for the review. I've tried to address your > > > comments in v3 (sent yesterday), so that is why it took some time to reply. > > > > I will try to look at it sometimes next week hopefully > > Thanks! > > > > > - You seem to completely ignore per task oom_score_adj and override it > > > > by the memcg value. This makes some sense but it can lead to an > > > > unexpected behavior when somebody relies on the original behavior. > > > > E.g. a workload that would corrupt data when killed unexpectedly and > > > > so it is protected by OOM_SCORE_ADJ_MIN. Now this assumption will > > > > break when running inside a container. I do not have a good answer > > > > what is the desirable behavior and maybe there is no universal answer. > > > > Maybe you just do not to kill those tasks? But then you have to be > > > > careful when selecting a memcg victim. Hairy... > > > > > > I do not ignore it completely, but it matters only for root cgroup tasks > > > and inside a cgroup when oom_kill_all_tasks is off. > > > > > > I believe, that cgroup v2 requirement is a good enough. I mean you can't > > > move from v1 to v2 without changing cgroup settings, and if we will provide > > > per-cgroup oom_score_adj, it will be enough to reproduce the old behavior. > > > > > > Also, if you think it's necessary, I can add a sysctl to turn the cgroup-aware > > > oom killer off completely and provide compatibility mode. > > > We can't really save the old system-wide behavior of per-process oom_score_adj, > > > it makes no sense in the containerized environment. > > > > So what you are going to do with those applications that simply cannot > > be killed and which set OOM_SCORE_ADJ_MIN explicitly. Are they > > unsupported? How does a user find out? One way around this could be to > > simply to not kill tasks with OOM_SCORE_ADJ_MIN. > > They won't be killed by cgroup OOM, but under some circumstances can be killed > by the global OOM (e.g. there are no other tasks in the selected cgroup, > cgroup v2 is used, and per-cgroup oom score adjustment is not set). Hmm, mem_cgroup_select_oom_victim will happily select a memcg which contains OOM_SCORE_ADJ_MIN tasks because it ignores per-task score adj. So memcg OOM killer can kill those tasks AFAICS. But that is not all that important. Becasuse... > I believe, that per-process oom_score_adj should not play any role outside > of the containing cgroup, it's violation of isolation. > > Right now if tasks with oom_score_adj=-1000 eating all memory in a cgroup, > they will be looping forever, OOM killer can't fix this. ... Yes and that is a price we have to pay for the hard requirement that oom killer never kills OOM_SCORE_ADJ_MIN task. It is hard to change that without breaking any existing userspace which relies on the configuration to protect from an unexpected SIGKILL. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>