Killing a user process as a result of hitting memcg limits is a serious decision that is unfortunately needed only when no forward progress in reclaiming memory can be made. Deciding the appropriate oom victim can take a sufficient amount of time that allows another process that is exiting to actually uncharge to the same memcg hierarchy and prevent unnecessarily killing user processes. An example is to prevent *multiple* unnecessary oom kills on a system with two cores where the oom kill occurs when there is an abundance of free memory available: Memory cgroup out of memory: Killed process 628 (repro) total-vm:41944kB, anon-rss:40888kB, file-rss:496kB, shmem-rss:0kB, UID:0 pgtables:116kB oom_score_adj:0 <immediately after> repro invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0 CPU: 1 PID: 629 Comm: repro Not tainted 5.6.0-rc5+ #130 Call Trace: dump_stack+0x78/0xb6 dump_header+0x55/0x240 oom_kill_process+0xc5/0x170 out_of_memory+0x305/0x4a0 try_charge+0x77b/0xac0 mem_cgroup_try_charge+0x10a/0x220 mem_cgroup_try_charge_delay+0x1e/0x40 handle_mm_fault+0xdf2/0x15f0 do_user_addr_fault+0x21f/0x420 async_page_fault+0x2f/0x40 memory: usage 61336kB, limit 102400kB, failcnt 74 Notice the second memcg oom kill shows usage is >40MB below its limit of 100MB but a process is still unnecessarily killed because the decision has already been made to oom kill by calling out_of_memory() before the initial victim had a chance to uncharge its memory. Make a last minute check to determine if an oom kill is really needed to prevent unnecessary oom killing. Cc: Vlastimil Babka <vbabka@xxxxxxx> Cc: Michal Hocko <mhocko@xxxxxxxxxx> Cc: stable@xxxxxxxxxxxxxxx Signed-off-by: David Rientjes <rientjes@xxxxxxxxxx> --- include/linux/memcontrol.h | 7 +++++++ mm/memcontrol.c | 2 +- mm/oom_kill.c | 16 +++++++++++++--- 3 files changed, 21 insertions(+), 4 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -445,6 +445,8 @@ void mem_cgroup_iter_break(struct mem_cgroup *, struct mem_cgroup *); int mem_cgroup_scan_tasks(struct mem_cgroup *, int (*)(struct task_struct *, void *), void *); +unsigned long mem_cgroup_margin(struct mem_cgroup *memcg); + static inline unsigned short mem_cgroup_id(struct mem_cgroup *memcg) { if (mem_cgroup_disabled()) @@ -945,6 +947,11 @@ static inline int mem_cgroup_scan_tasks(struct mem_cgroup *memcg, return 0; } +static inline unsigned long mem_cgroup_margin(struct mem_cgroup *memcg) +{ + return 0; +} + static inline unsigned short mem_cgroup_id(struct mem_cgroup *memcg) { return 0; diff --git a/mm/memcontrol.c b/mm/memcontrol.c --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1286,7 +1286,7 @@ void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru, * Returns the maximum amount of memory @mem can be charged with, in * pages. */ -static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg) +unsigned long mem_cgroup_margin(struct mem_cgroup *memcg) { unsigned long margin = 0; unsigned long count; diff --git a/mm/oom_kill.c b/mm/oom_kill.c --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -972,9 +972,6 @@ static void oom_kill_process(struct oom_control *oc, const char *message) } task_unlock(victim); - if (__ratelimit(&oom_rs)) - dump_header(oc, victim); - /* * Do we need to kill the entire memory cgroup? * Or even one of the ancestor memory cgroups? @@ -982,6 +979,19 @@ static void oom_kill_process(struct oom_control *oc, const char *message) */ oom_group = mem_cgroup_get_oom_group(victim, oc->memcg); + if (is_memcg_oom(oc)) { + cond_resched(); + + /* One last check: do we *really* need to kill? */ + if (mem_cgroup_margin(oc->memcg) >= (1 << oc->order)) { + put_task_struct(victim); + return; + } + } + + if (__ratelimit(&oom_rs)) + dump_header(oc, victim); + __oom_kill_process(victim, message); /*