Re: [PATCH] mm, oom: allow oom reaper to race with exit_mmap

Andrea Arcangeli <aarcange@xxxxxxxxxx> · Tue, 25 Jul 2017 20:26:19 +0200

On Tue, Jul 25, 2017 at 05:45:14PM +0200, Michal Hocko wrote:
> That problem is real though as reported by David.

I'm not against fixing it, I just think it's not a major concern, and
the solution doesn't seem optimal as measured by Kirill.

I'm just skeptical it's the best to solve that tiny race, 99.9% of the
time such down_write is unnecessary.

> it is not only about exit_mmap. __mmput calls into exit_aio and that can
> wait for completion and there is no way to guarantee this will finish in
> finite time.

exit_aio blocking is actually the only good point for wanting this
concurrency where exit_mmap->unmap_vmas and
oom_reap_task->unmap_page_range have to run concurrently on the same
mm.

exit_mmap would have no issue, if there was enough time in the
lifetime CPU to allocate the memory, sure the memory will also be
freed in finite amount of time by exit_mmap.

In fact you mentioned multiple OOM in the NUMA case, exit_mmap may not
solve that, so depending on the runtime it may have been better not to
wait all memory of the process to be freed before moving to the next
task, but only a couple of seconds before the OOM reaper moves to a
new candidate. Again this is only a tradeoff between solving the OOM
faster vs risk of false positives OOM.

If it wasn't because of exit_aio (which may have to wait I/O
completion), changing the OOM reaper to return "false" if
mmget_not_zero returns zero and MMF_OOM_SKIP is not set yet, would
have been enough (and depending on the runtime it may have solved OOM
faster in NUMA) and there would be absolutely no need to run OOM
reaper and exit_mmap concurrently on the same mm. However there's such
exit_aio..

Raw I/O mempools never require memory allocations, although aio if it
involves a filesystem to complete may run into filesystem or buffering
locks which are known to loop forever or depend on other tasks stuck
in kernel allocations, so I didn't go down that chain too long.

So the simplest is to use a similar trick to what ksm_exit uses, this
is untested just to show the idea it may require further adjustment as
the bit isn't used only for the test_and_set_bit locking, but I didn't
see much issues with other set_bit/test_bit.

>From f414244480fdc1f771b3148feb3fac77ec813e4c Mon Sep 17 00:00:00 2001
From: Andrea Arcangeli <aarcange@xxxxxxxxxx>
Date: Tue, 25 Jul 2017 20:02:27 +0200
Subject: [PATCH 1/1] mm: oom: let oom_reap_task and exit_mmap to run
 concurrently

This is purely required because exit_aio() may block and exit_mmap() may
never start, if the oom_reap_task cannot start running on a mm with
mm_users == 0.

At the same time if the OOM reaper doesn't wait at all for the memory
of the current OOM candidate to be freed by exit_mmap->unmap_vmas, it
would generate a spurious OOM kill.

If it wasn't because of the exit_aio it would be enough to change the
oom_reap_task in the case it finds mm_users == 0, to wait for a
timeout or to wait for __mmput to set MMF_OOM_SKIP itself, but it's
not just exit_mmap the problem here so the concurrency of exit_mmap
and oom_reap_task is apparently warranted.

It's a non standard runtime, exit_mmap runs without mmap_sem, and
oom_reap_task runs with the mmap_sem for reading as usual (kind of
MADV_DONTNEED).

The race between the two is solved with pair of test_and_set_bit and a
dummy down_write/up_write cycle on the same lines of the ksm_exit
method.

If the OOM reaper sets the bit first, exit_mmap will wait it to finish
in down_write (before taking down mm structures that would make the
oom_reap_task fail with use after free).

If exit_mmap comes first, oom reaper will skip the mm as it's already
all freed.

Signed-off-by: Andrea Arcangeli <aarcange@xxxxxxxxxx>
---
 kernel/fork.c |  1 -
 mm/mmap.c     |  5 +++++
 mm/oom_kill.c | 15 ++++++++++++---
 3 files changed, 17 insertions(+), 4 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 9ec98b0c4675..ed412d85a596 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -910,7 +910,6 @@ static inline void __mmput(struct mm_struct *mm)
 	}
 	if (mm->binfmt)
 		module_put(mm->binfmt->module);
-	set_bit(MMF_OOM_SKIP, &mm->flags);
 	mmdrop(mm);
 }
 
diff --git a/mm/mmap.c b/mm/mmap.c
index f19efcf75418..615133762b99 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2993,6 +2993,11 @@ void exit_mmap(struct mm_struct *mm)
 	/* Use -1 here to ensure all VMAs in the mm are unmapped */
 	unmap_vmas(&tlb, vma, 0, -1);
 
+	if (test_and_set_bit(MMF_OOM_SKIP, &mm->flags)) {
+		/* wait the OOM reaper to stop working on this mm */
+		down_write(&mm->mmap_sem);
+		up_write(&mm->mmap_sem);
+	}
 	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
 	tlb_finish_mmu(&tlb, 0, -1);
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 9e8b4f030c1c..2a7000995784 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -471,6 +471,7 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 	struct mmu_gather tlb;
 	struct vm_area_struct *vma;
 	bool ret = true;
+	bool mmgot = true;
 
 	/*
 	 * We have to make sure to not race with the victim exit path
@@ -500,9 +501,16 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 	 * and delayed __mmput doesn't matter that much
 	 */
 	if (!mmget_not_zero(mm)) {
-		up_read(&mm->mmap_sem);
 		trace_skip_task_reaping(tsk->pid);
-		goto unlock_oom;
+		/*
+		 * MMF_OOM_SKIP is set by exit_mmap when the OOM
+		 * reaper can't work on the mm anymore.
+		 */
+		if (test_and_set_bit(MMF_OOM_SKIP, &mm->flags)) {
+			up_read(&mm->mmap_sem);
+			goto unlock_oom;
+		}
+		mmgot = false;
 	}
 
 	trace_start_task_reaping(tsk->pid);
@@ -547,7 +555,8 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 	 * different context because we shouldn't risk we get stuck there and
 	 * put the oom_reaper out of the way.
 	 */
-	mmput_async(mm);
+	if (mmgot)
+		mmput_async(mm);
 	trace_finish_task_reaping(tsk->pid);
 unlock_oom:
 	mutex_unlock(&oom_lock);

This will perform the same as then the set_bit in __mmput can be
removed and test_and_set_bit doesn't cost more (at least on x86, on
other archs it requires an smp_mb too, but it's marginal difference),
still an atomic op that is.

> > 4) how is it safe to overwrite a VM_FAULT_RETRY that returns without
> >    mmap_sem and then the arch code will release the mmap_sem despite
> >    it was already released by handle_mm_fault? Anonymous memory faults
> >    aren't common to return VM_FAULT_RETRY but an userfault
> >    can. Shouldn't there be a block that prevents overwriting if
> >    VM_FAULT_RETRY is set below? (not only VM_FAULT_ERROR)
> > 
> > 	if (unlikely((current->flags & PF_KTHREAD) && !(ret & VM_FAULT_ERROR)
> > 				&& test_bit(MMF_UNSTABLE, &vma->vm_mm->flags)))
> > 		ret = VM_FAULT_SIGBUS;
> 
> I am not sure I understand what you mean and how this is related to the
> patch?

It's not related to the patch but it involves the OOM reaper as it
only happens when MMF_UNSTABLE is set which is set only by the OOM
reaper. I was simply reading the OOM reaper code and following up what
MMF_UNSTABLE does and it ringed a bell.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>