On Thu, May 18, 2017 at 11:00:39AM +0200, Michal Hocko wrote: > On Thu 18-05-17 10:47:29, Michal Hocko wrote: > > > > Hmm, I guess you are right. I haven't realized that pagefault_out_of_memory > > can race and pick up another victim. For some reason I thought that the > > page fault would break out on fatal signal pending but we don't do that (we > > used to in the past). Now that I think about that more we should > > probably remove out_of_memory out of pagefault_out_of_memory completely. > > It is racy and it basically doesn't have any allocation context so we > > might kill a task from a different domain. So can we do this instead? > > There is a slight risk that somebody might have returned VM_FAULT_OOM > > without doing an allocation but from my quick look nobody does that > > currently. > > If this is considered too risky then we can do what Roman was proposing > and check tsk_is_oom_victim in pagefault_out_of_memory and bail out. Hi, Michal! If we consider this approach, I've prepared a separate patch for this problem (stripped all oom reaper list stuff). Thanks! >From 317fad44a0fe79fb76e8e4fd6bd81c52ae1712e9 Mon Sep 17 00:00:00 2001 From: Roman Gushchin <guro@xxxxxx> Date: Tue, 16 May 2017 21:19:56 +0100 Subject: [PATCH] mm,oom: prevent OOM double kill from a pagefault handling path During the debugging of some OOM-related stuff, I've noticed that sometimes OOM kills two processes instead of one. The problem can be easily reproduced on a vanilla kernel: [ 25.721494] allocate invoked oom-killer: gfp_mask=0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null), order=0, oom_score_adj=0 [ 25.725658] allocate cpuset=/ mems_allowed=0 [ 25.727033] CPU: 1 PID: 492 Comm: allocate Not tainted 4.12.0-rc1-mm1+ #181 [ 25.729215] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 [ 25.729598] Call Trace: [ 25.729598] dump_stack+0x63/0x82 [ 25.729598] dump_header+0x97/0x21a [ 25.729598] ? do_try_to_free_pages+0x2d7/0x360 [ 25.729598] ? security_capable_noaudit+0x45/0x60 [ 25.729598] oom_kill_process+0x219/0x3e0 [ 25.729598] out_of_memory+0x11d/0x480 [ 25.729598] __alloc_pages_slowpath+0xc84/0xd40 [ 25.729598] __alloc_pages_nodemask+0x245/0x260 [ 25.729598] alloc_pages_vma+0xa2/0x270 [ 25.729598] __handle_mm_fault+0xca9/0x10c0 [ 25.729598] handle_mm_fault+0xf3/0x210 [ 25.729598] __do_page_fault+0x240/0x4e0 [ 25.729598] trace_do_page_fault+0x37/0xe0 [ 25.729598] do_async_page_fault+0x19/0x70 [ 25.729598] async_page_fault+0x28/0x30 < cut > [ 25.810868] oom_reaper: reaped process 492 (allocate), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB < cut > [ 25.817589] allocate invoked oom-killer: gfp_mask=0x0(), nodemask=(null), order=0, oom_score_adj=0 [ 25.818821] allocate cpuset=/ mems_allowed=0 [ 25.819259] CPU: 1 PID: 492 Comm: allocate Not tainted 4.12.0-rc1-mm1+ #181 [ 25.819847] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 [ 25.820549] Call Trace: [ 25.820733] dump_stack+0x63/0x82 [ 25.820961] dump_header+0x97/0x21a [ 25.820961] ? security_capable_noaudit+0x45/0x60 [ 25.820961] oom_kill_process+0x219/0x3e0 [ 25.820961] out_of_memory+0x11d/0x480 [ 25.820961] pagefault_out_of_memory+0x68/0x80 [ 25.820961] mm_fault_error+0x8f/0x190 [ 25.820961] ? handle_mm_fault+0xf3/0x210 [ 25.820961] __do_page_fault+0x4b2/0x4e0 [ 25.820961] trace_do_page_fault+0x37/0xe0 [ 25.820961] do_async_page_fault+0x19/0x70 [ 25.820961] async_page_fault+0x28/0x30 < cut > [ 25.863078] Out of memory: Kill process 233 (firewalld) score 10 or sacrifice child [ 25.863634] Killed process 233 (firewalld) total-vm:246076kB, anon-rss:20956kB, file-rss:0kB, shmem-rss:0kB This actually happens if pagefault_out_of_memory() is called after the calling process has already been selected as an OOM victim and killed. There is a race with the oom reaper: if the process is reaped before it enters out_of_memory(), the MMF_OOM_SKIP flag is set, and out_of_memory() will not consider the process as a eligible victim. That means that another victim will be selected and killed. Tetsuo Handa has noticed, that this is a side effect of commit 9a67f6488eca926f ("mm: consolidate GFP_NOFAIL checks in the allocator slowpath"). To avoid this, out_of_memory() shouldn't be called from pagefault_out_of_memory(), if current task already has been chosen as an oom victim. v2: dropped changes related to the oom_reaper synchronization, as it looks like a separate and minor issue; rebased on new mm; renamed, updated commit message. Signed-off-by: Roman Gushchin <guro@xxxxxx> Cc: Michal Hocko <mhocko@xxxxxxxx> Cc: Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx> Cc: Johannes Weiner <hannes@xxxxxxxxxxx> Cc: Vladimir Davydov <vdavydov.dev@xxxxxxxxx> Cc: kernel-team@xxxxxx Cc: linux-mm@xxxxxxxxx Cc: linux-kernel@xxxxxxxxxxxxxxx --- mm/oom_kill.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 04c9143..9c643a3 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -1068,6 +1068,9 @@ void pagefault_out_of_memory(void) if (mem_cgroup_oom_synchronize(true)) return; + if (tsk_is_oom_victim(current)) + return; + if (!mutex_trylock(&oom_lock)) return; out_of_memory(&oc); -- 2.7.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>