Re: [patch v2] mm, oom: fix concurrent munlock and oom reaper unmap

Michal Hocko <mhocko@xxxxxxxxxx> · Fri, 20 Apr 2018 14:40:44 +0200

On Fri 20-04-18 10:23:49, Michal Hocko wrote:
> On Thu 19-04-18 12:34:53, David Rientjes wrote:
[...]
> > I understand the concern, but it's the difference between the victim 
> > getting stuck in exit_mmap() and actually taking a long time to free its 
> > memory in exit_mmap().  I don't have evidence of the former.
> 
> I do not really want to repeat myself. The primary purpose of the oom
> reaper is to provide a _guarantee_ of the forward progress. I do not
> care whether there is any evidences. All I know that lock_page has
> plethora of different dependencies and we cannot clearly state this is
> safe so we _must not_ depend on it when setting MMF_OOM_SKIP.
> 
> The way how the oom path was fragile and lockup prone based on
> optimistic assumptions shouldn't be repeated.
> 
> That being said, I haven't heard any actual technical argument about why
> locking the munmap path is a wrong thing to do while the MMF_OOM_SKIP
> dependency on the page_lock really concerns me so
> 
> Nacked-by: Michal Hocko <mhocko@xxxxxxxx>
> 
> If you want to keep the current locking protocol then you really have to
> make sure that the oom reaper will set MMF_OOM_SKIP when racing with
> exit_mmap.

So here is my suggestion for the fix.

>From 37ab85d619f508ceaf829e57648a3d986c6d8bfc Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@xxxxxxxx>
Date: Fri, 20 Apr 2018 13:53:08 +0200
Subject: [PATCH] oom: mm, oom: fix concurrent munlock and oom reaper unmap

David has noticed that our current assumption that the oom reaper can
race with exit_mmap is not correct. munlock_vma_pages_all() depends on
clearing VM_LOCKED from vm_flags before actually doing the munlock to
determine if any other vmas are locking the same memory, the check for
VM_LOCKED in the oom reaper is racy.

This is especially noticeable on architectures such as powerpc where
clearing a huge pmd requires serialize_against_pte_lookup().  If the pmd
is zapped by the oom reaper during follow_page_mask() after the check for
pmd_none() is bypassed, this ends up deferencing a NULL ptl.

Fix this by taking the exclusive mmap_sem in exit_mmap while tearing
down the address space. Once that is done MMF_OOM_SKIP is set so that
__oom_reap_task_mm can back off if it manages to take the read lock
finally.

Fixes: 212925802454 ("mm: oom: let oom_reap_task and exit_mmap run concurrently")
Cc: stable
Signed-off-by: Michal Hocko <mhocko@xxxxxxxx>
---
 mm/mmap.c     | 36 ++++++++++++++++++------------------
 mm/oom_kill.c |  2 +-
 2 files changed, 19 insertions(+), 19 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index faf85699f1a1..216efa6d9f61 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3004,10 +3004,21 @@ void exit_mmap(struct mm_struct *mm)
 	struct mmu_gather tlb;
 	struct vm_area_struct *vma;
 	unsigned long nr_accounted = 0;
+	bool locked = false;
 
 	/* mm's last user has gone, and its about to be pulled down */
 	mmu_notifier_release(mm);
 
+	/*
+	 * The mm is not accessible for anybody except for the oom reaper
+	 * which cannot race with munlocking so make sure we exclude the
+	 * two.
+	 */
+	if (unlikely(mm_is_oom_victim(mm))) {
+		down_write(&mm->mmap_sem);
+		locked = true;
+	}
+
 	if (mm->locked_vm) {
 		vma = mm->mmap;
 		while (vma) {
@@ -3021,7 +3032,7 @@ void exit_mmap(struct mm_struct *mm)
 
 	vma = mm->mmap;
 	if (!vma)	/* Can happen if dup_mmap() received an OOM */
-		return;
+		goto out_unlock;
 
 	lru_add_drain();
 	flush_cache_mm(mm);
@@ -3030,23 +3041,6 @@ void exit_mmap(struct mm_struct *mm)
 	/* Use -1 here to ensure all VMAs in the mm are unmapped */
 	unmap_vmas(&tlb, vma, 0, -1);
 
-	if (unlikely(mm_is_oom_victim(mm))) {
-		/*
-		 * Wait for oom_reap_task() to stop working on this
-		 * mm. Because MMF_OOM_SKIP is already set before
-		 * calling down_read(), oom_reap_task() will not run
-		 * on this "mm" post up_write().
-		 *
-		 * mm_is_oom_victim() cannot be set from under us
-		 * either because victim->mm is already set to NULL
-		 * under task_lock before calling mmput and oom_mm is
-		 * set not NULL by the OOM killer only if victim->mm
-		 * is found not NULL while holding the task_lock.
-		 */
-		set_bit(MMF_OOM_SKIP, &mm->flags);
-		down_write(&mm->mmap_sem);
-		up_write(&mm->mmap_sem);
-	}
 	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
 	tlb_finish_mmu(&tlb, 0, -1);
 
@@ -3060,6 +3054,12 @@ void exit_mmap(struct mm_struct *mm)
 		vma = remove_vma(vma);
 	}
 	vm_unacct_memory(nr_accounted);
+
+out_unlock:
+	if (unlikely(locked)) {
+		set_bit(MMF_OOM_SKIP, &mm->flags);
+		up_write(&mm->mmap_sem);
+	}
 }
 
 /* Insert vm structure into process list sorted by address
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index dfd370526909..94d26ef2c3c7 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -524,7 +524,7 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 	 * MMF_OOM_SKIP is set by exit_mmap when the OOM reaper can't
 	 * work on the mm anymore. The check for MMF_OOM_SKIP must run
 	 * under mmap_sem for reading because it serializes against the
-	 * down_write();up_write() cycle in exit_mmap().
+	 * exit_mmap().
 	 */
 	if (test_bit(MMF_OOM_SKIP, &mm->flags)) {
 		up_read(&mm->mmap_sem);
-- 
2.16.3

-- 
Michal Hocko
SUSE Labs