+ mm-oom-allow-oom-reaper-to-race-with-exit_mmap.patch added to -mm tree

akpm@xxxxxxxxxxxxxxxxxxxx · Thu, 10 Aug 2017 14:03:12 -0700

The patch titled
     Subject: mm, oom: allow oom reaper to race with exit_mmap
has been added to the -mm tree.  Its filename is
     mm-oom-allow-oom-reaper-to-race-with-exit_mmap.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-oom-allow-oom-reaper-to-race-with-exit_mmap.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-oom-allow-oom-reaper-to-race-with-exit_mmap.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/SubmitChecklist when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Michal Hocko <mhocko@xxxxxxxx>
Subject: mm, oom: allow oom reaper to race with exit_mmap

David has noticed that the oom killer might kill additional tasks while
the exiting oom victim hasn't terminated yet because the oom_reaper marks
the curent victim MMF_OOM_SKIP too early when mm->mm_users dropped down to
0.  The race is as follows

oom_reap_task				do_exit
					  exit_mm
  __oom_reap_task_mm
					    mmput
					      __mmput
    mmget_not_zero # fails
    						exit_mmap # frees memory
  set_bit(MMF_OOM_SKIP)

The victim is still visible to the OOM killer until it is unhashed.

Currently we try to reduce a risk of this race by taking oom_lock and wait
for out_of_memory sleep while holding the lock to give the victim some
time to exit.  This is quite suboptimal approach because there is no
guarantee the victim (especially a large one) will manage to unmap its
address space and free enough memory to the particular oom domain which
needs a memory (e.g.  a specific NUMA node).

Fix this problem by allowing __oom_reap_task_mm and __mmput path to race. 
__oom_reap_task_mm is basically MADV_DONTNEED and that is allowed to run
in parallel with other unmappers (hence the mmap_sem for read).

The only tricky part is to exclude page tables tear down and all
operations which modify the address space in the __mmput path.  exit_mmap
doesn't expect any other users so it doesn't use any locking.  Nothing
really forbids us to use mmap_sem for write, though.  In fact we are
already relying on this lock earlier in the __mmput path to synchronize
with ksm and khugepaged.

Take the exclusive mmap_sem when calling free_pgtables and destroying vmas
to sync with __oom_reap_task_mm which take the lock for read.  All other
operations can safely race with the parallel unmap.

The previous version of the patch has been posted here [1].  The original
patch has taken mmap_sem in exit_mmap unconditionally but Kirill was
worried this could have a performance impact (we should exercise the fast
path most of the time because nobody should be holding lock at that
stage).  An artificial testcase [2] has shown ~3% difference but numbers
are quite noisy [3] so it is effect is not all that clear.  Anyway I have
made the lock conditional for oom victims.

Andrea has proposed and alternative solution [4] which should be
equivalent functionally similar to {ksm,khugepaged}_exit.  I have to
confess I really don't like that approach but I can live with it if that
is a preferred way (to be honest I would like to drop the empty
down_write();up_write() from the other two callers as well).  In fact I
have asked Andrea to post his patch [5] but that hasn't happened.  I do
not think we should wait much longer and finally merge some fix.

[1] http://lkml.kernel.org/r/20170724072332.31903-1-mhocko@xxxxxxxxxx
[2] http://lkml.kernel.org/r/20170725142626.GJ26723@xxxxxxxxxxxxxx
[3] http://lkml.kernel.org/r/20170725160359.GO26723@xxxxxxxxxxxxxx
[4] http://lkml.kernel.org/r/20170726162912.GA29716@xxxxxxxxxx
[5] http://lkml.kernel.org/r/20170728062345.GA2274@xxxxxxxxxxxxxx

Link: http://lkml.kernel.org/r/20170810081632.31265-1-mhocko@xxxxxxxxxx
Fixes: 26db62f179d1 ("oom: keep mm of the killed task available")
Signed-off-by: Michal Hocko <mhocko@xxxxxxxx>
Reported-by: David Rientjes <rientjes@xxxxxxxxxx>
Cc: Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx>
Cc: Oleg Nesterov <oleg@xxxxxxxxxx>
Cc: Andrea Argangeli <andrea@xxxxxxxxxx>
Cc: Hugh Dickins <hughd@xxxxxxxxxx>
Cc: "Kirill A. Shutemov" <kirill@xxxxxxxxxxxxx>
Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
---

 mm/mmap.c     |   16 ++++++++++++++++
 mm/oom_kill.c |   47 ++++++++---------------------------------------
 2 files changed, 24 insertions(+), 39 deletions(-)

diff -puN mm/mmap.c~mm-oom-allow-oom-reaper-to-race-with-exit_mmap mm/mmap.c

--- a/mm/mmap.c~mm-oom-allow-oom-reaper-to-race-with-exit_mmap
+++ a/mm/mmap.c
@@ -44,6 +44,7 @@
 #include <linux/userfaultfd_k.h>
 #include <linux/moduleparam.h>
 #include <linux/pkeys.h>
+#include <linux/oom.h>
 
 #include <linux/uaccess.h>
 #include <asm/cacheflush.h>
@@ -2975,6 +2976,7 @@ void exit_mmap(struct mm_struct *mm)
 	struct mmu_gather tlb;
 	struct vm_area_struct *vma;
 	unsigned long nr_accounted = 0;
+	bool locked = false;
 
 	/* mm's last user has gone, and its about to be pulled down */
 	mmu_notifier_release(mm);
@@ -3001,6 +3003,17 @@ void exit_mmap(struct mm_struct *mm)
 	/* Use -1 here to ensure all VMAs in the mm are unmapped */
 	unmap_vmas(&tlb, vma, 0, -1);
 
+	/*
+	 * oom reaper might race with exit_mmap so make sure we won't free
+	 * page tables or unmap VMAs under its feet
+	 * Please note that mark_oom_victim is always called under task_lock
+	 * with tsk->mm != NULL checked on !current tasks which synchronizes
+	 * with exit_mm and so we cannot race here.
+	 */
+	if (tsk_is_oom_victim(current)) {
+		down_write(&mm->mmap_sem);
+		locked = true;
+	}
 	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
 	tlb_finish_mmu(&tlb, 0, -1);
 
@@ -3013,7 +3026,10 @@ void exit_mmap(struct mm_struct *mm)
 			nr_accounted += vma_pages(vma);
 		vma = remove_vma(vma);
 	}
+	mm->mmap = NULL;
 	vm_unacct_memory(nr_accounted);
+	if (locked)
+		up_write(&mm->mmap_sem);
 }
 
 /* Insert vm structure into process list sorted by address
diff -puN mm/oom_kill.c~mm-oom-allow-oom-reaper-to-race-with-exit_mmap mm/oom_kill.c
--- a/mm/oom_kill.c~mm-oom-allow-oom-reaper-to-race-with-exit_mmap
+++ a/mm/oom_kill.c
@@ -470,40 +470,15 @@ static bool __oom_reap_task_mm(struct ta
 {
 	struct mmu_gather tlb;
 	struct vm_area_struct *vma;
-	bool ret = true;
-
-	/*
-	 * We have to make sure to not race with the victim exit path
-	 * and cause premature new oom victim selection:
-	 * __oom_reap_task_mm		exit_mm
-	 *   mmget_not_zero
-	 *				  mmput
-	 *				    atomic_dec_and_test
-	 *				  exit_oom_victim
-	 *				[...]
-	 *				out_of_memory
-	 *				  select_bad_process
-	 *				    # no TIF_MEMDIE task selects new victim
-	 *  unmap_page_range # frees some memory
-	 */
-	mutex_lock(&oom_lock);
 
 	if (!down_read_trylock(&mm->mmap_sem)) {
-		ret = false;
 		trace_skip_task_reaping(tsk->pid);
-		goto unlock_oom;
+		return false;
 	}
 
-	/*
-	 * increase mm_users only after we know we will reap something so
-	 * that the mmput_async is called only when we have reaped something
-	 * and delayed __mmput doesn't matter that much
-	 */
-	if (!mmget_not_zero(mm)) {
-		up_read(&mm->mmap_sem);
-		trace_skip_task_reaping(tsk->pid);
-		goto unlock_oom;
-	}
+	/* There is nothing to reap so bail out without signs in the log */
+	if (!mm->mmap)
+		goto unlock;
 
 	trace_start_task_reaping(tsk->pid);
 
@@ -540,18 +515,12 @@ static bool __oom_reap_task_mm(struct ta
 			K(get_mm_counter(mm, MM_ANONPAGES)),
 			K(get_mm_counter(mm, MM_FILEPAGES)),
 			K(get_mm_counter(mm, MM_SHMEMPAGES)));
-	up_read(&mm->mmap_sem);
 
-	/*
-	 * Drop our reference but make sure the mmput slow path is called from a
-	 * different context because we shouldn't risk we get stuck there and
-	 * put the oom_reaper out of the way.
-	 */
-	mmput_async(mm);
 	trace_finish_task_reaping(tsk->pid);
-unlock_oom:
-	mutex_unlock(&oom_lock);
-	return ret;
+unlock:
+	up_read(&mm->mmap_sem);
+
+	return true;
 }
 
 #define MAX_OOM_REAP_RETRIES 10
_

Patches currently in -mm which might be from mhocko@xxxxxxxx are

mm-memory_hotplug-display-allowed-zones-in-the-preferred-ordering.patch
mm-memory_hotplug-remove-zone-restrictions.patch
mm-page_alloc-rip-out-zonelist_order_zone.patch
mm-page_alloc-remove-boot-pageset-initialization-from-memory-hotplug.patch
mm-page_alloc-do-not-set_cpu_numa_mem-on-empty-nodes-initialization.patch
mm-memory_hotplug-drop-zone-from-build_all_zonelists.patch
mm-memory_hotplug-remove-explicit-build_all_zonelists-from-try_online_node.patch
mm-page_alloc-simplify-zonelist-initialization.patch
mm-page_alloc-remove-stop_machine-from-build_all_zonelists.patch
mm-memory_hotplug-get-rid-of-zonelists_mutex.patch
mm-sparse-page_ext-drop-ugly-n_high_memory-branches-for-allocations.patch
mm-vmscan-do-not-loop-on-too_many_isolated-for-ever.patch
mm-vmscan-do-not-loop-on-too_many_isolated-for-ever-fix.patch
treewide-remove-gfp_temporary-allocation-flag.patch
mm-rename-global_page_state-to-global_zone_page_state.patch
mm-hugetlb-do-not-allocate-non-migrateable-gigantic-pages-from-movable-zones.patch
mm-oom-allow-oom-reaper-to-race-with-exit_mmap.patch
fs-proc-remove-priv-argument-from-is_stack.patch

--
To unsubscribe from this list: send the line "unsubscribe mm-commits" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html