Re: [PATCH] mm,oom: remove shortcuts for SIGKILL and PF_EXITING cases

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Tetsuo Handa wrote:
> > No, NACK.  You cannot prohibit an exiting process from gaining access to 
> > memory reserves and randomly killing another process without additional 
> > chances of a livelock.  The goal is for an exiting or killed process to 
> > be able to exit so it can free its memory, not kill additional processes.
> 
> I know what these shortcuts are trying to do. I'm pointing out that these
> shortcuts have a chance of silent OOM livelock. If we preserve these shortcuts,
> we had better not to wait forever. We need to kill additional processes if
> exiting or killed process seems to got stuck.
> 
> Same with http://lkml.kernel.org/r/20160217143917.GP29196@xxxxxxxxxxxxxx .
> 
Or, can we accept something like below?
--------------------------------------------------------------------------------
>From 3a9231486624ad34bbf84f9798523c05f5d401d5 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx>
Date: Tue, 23 Feb 2016 22:04:49 +0900
Subject: [PATCH] mm,oom: check mmap_sem lockability when using shortcuts for
 SIGKILL and PF_EXITING cases

Currently, oom_scan_process_thread() returns OOM_SCAN_ABORT when there is
a thread which is exiting. But it is possible that that thread is blocked
at down_read(&mm->mmap_sem) in exit_mm() called from do_exit() whereas
one of threads sharing that memory is doing a GFP_KERNEL allocation
between down_write(&mm->mmap_sem) and up_write(&mm->mmap_sem)
(e.g. mmap()).

----------
T1                  T2
                    Calls mmap()
Calls _exit(0)
                    Arrives at vm_mmap_pgoff()
Arrives at do_exit()
Gets PF_EXITING via exit_signals()
                    Calls down_write(&mm->mmap_sem)
                    Calls do_mmap_pgoff()
Calls down_read(&mm->mmap_sem) from exit_mm()
                    Calls out of memory via a GFP_KERNEL allocation but
                    oom_scan_process_thread(T1) returns OOM_SCAN_ABORT
----------

down_read(&mm->mmap_sem) by T1 is waiting for up_write(&mm->mmap_sem) by
T2 while oom_scan_process_thread() by T2 is waiting for T1 to set
T1->mm = NULL. Under such situation, the OOM killer does not choose
a victim, which results in silent OOM livelock problem.

Also, sending SIGKILL to all user processes sharing the same memory is
omitted by three locations.

About setting TIF_MEMIE on current->mm && fatal_signal_pending(current)
at out_of_memory():

  There are two thread groups named P1 and P2 that are created using
  clone(!CLONE_SIGHAND && CLONE_VM) and one independent thread group
  named P3. A sequence shown below is possible.

  ----------
  P1             P2             P3
  Do something that invokes a __GFP_FS memory allocation (e.g. page fault).
                 Calls mmap().
                                Calls kill(P1, SIGKILL).
                 Arrives at vm_mmap_pgoff().
                 Calls down_write(&mm->mmap_sem).
                                Sends SIGKILL to P1.
  fatal_signal_pending(P1) becomes true.
                 Calls do_mmap_pgoff().
  Calls out_of_memory().
  Gets TIF_MEMDIE.
                 Calls out_of_memory().
                 oom_scan_process_thread() returns OOM_SCAN_ABORT.
  Arrives at do_exit().
  Calls down_read(&mm->mmap_sem) at exit_mm().
                 oom_scan_process_thread() still returns OOM_SCAN_ABORT.
  ----------

  P1 is waiting for P2 to call up_write(&mm->mmap_sem) but P2 won't
  give up memory allocation because fatal_signal_pending(P2) is false.

About setting TIF_MEMIE on current->mm && task_will_free_mem(current)
at out_of_memory():

  There are two thread groups named P1 and P2 that are created using
  clone(CLONE_VM). A sequence shown below is possible.

  ----------
  P1             P2
  Calls _exit(0).
                 Calls mmap().
                 Arrives at vm_mmap_pgoff().
                 Calls down_write(&mm->mmap_sem).
  Arrives at do_exit().
  Gets PF_EXITING via exit_signals().
  Calls tty_audit_exit().
  Do a GFP_KERNEL allocation from tty_audit_log().
  Calls out_of_memory().
  Gets TIF_MEMDIE.
                 Calls out_of_memory().
                 oom_scan_process_thread() returns OOM_SCAN_ABORT.
  Calls down_read(&mm->mmap_sem) at exit_mm().
                 oom_scan_process_thread() still returns OOM_SCAN_ABORT.
  ----------

  P1 is waiting for P2 to call up_write(&mm->mmap_sem) but P2 won't
  give up memory allocation because fatal_signal_pending(P2) is false.

About setting TIF_MEMIE on p->mm && task_will_free_mem(p)
at oom_kill_process():

  There are two thread groups named P1 and P2 that are created using
  clone(CLONE_VM). A sequence shown below is possible.

  ----------
  P1             P2
  Calls _exit(0).
                 Calls mmap().
                 Arrives at vm_mmap_pgoff().
                 Calls down_write(&mm->mmap_sem).
  Arrives at do_exit().
  Gets PF_EXITING via exit_signals().
  Calls down_read(&mm->mmap_sem) at exit_mm().
                 Calls do_mmap_pgoff().
                 Calls out_of_memory().
                 select_bad_process() returns P1.
                 oom_kill_process() sets TIF_MEMDIE on P1.
                 oom_scan_process_thread() returns OOM_SCAN_ABORT.
  ----------

  P1 is waiting for P2 to call up_write(&mm->mmap_sem) but P2 won't
  give up memory allocation because fatal_signal_pending(P2) is false.

About setting TIF_MEMIE on fatal_signal_pending(current)
at mem_cgroup_out_of_memory():

  mem_cgroup_out_of_memory() is called from a lockless context via
  mem_cgroup_oom_synchronize() called from pagefault_out_of_memory()
  is talking about only current thread. If global OOM condition follows
  before memcg OOM condition is solved, the same problem will occur.

About setting TIF_MEMIE on task_will_free_mem(current)
at mem_cgroup_out_of_memory():

  I don't know whether it is possible to call mem_cgroup_out_of_memory()
  between getting PF_EXITING and doing current->mm = NULL. But if it is
  possible to call, then the same problem will occur.

This patch checks whether the mm of a thread which the caller of
out_of_memory() is trying to wait for termination can be locked for read,
in order to avoid silent OOM livelock.

Signed-off-by: Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx>
---
 include/linux/oom.h |  1 +
 mm/memcontrol.c     |  3 ++-
 mm/oom_kill.c       | 39 ++++++++++++++++++++++++++++++---------
 3 files changed, 33 insertions(+), 10 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 45993b8..efd7aa5 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -86,6 +86,7 @@ extern void check_panic_on_oom(struct oom_control *oc,
 			       enum oom_constraint constraint,
 			       struct mem_cgroup *memcg);
 
+extern bool task_can_read_lock_mm(struct task_struct *tsk);
 extern enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
 		struct task_struct *task, unsigned long totalpages);
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ae8b81c..c0dca1b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1258,7 +1258,8 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 * select it.  The goal is to allow it to allocate so that it may
 	 * quickly exit and free its memory.
 	 */
-	if (fatal_signal_pending(current) || task_will_free_mem(current)) {
+	if ((fatal_signal_pending(current) || task_will_free_mem(current)) &&
+	    task_can_read_lock_mm(current)) {
 		mark_oom_victim(current);
 		goto unlock;
 	}
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index d7bb9c1..8a27967 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -268,6 +268,22 @@ static enum oom_constraint constrained_alloc(struct oom_control *oc,
 }
 #endif
 
+bool task_can_read_lock_mm(struct task_struct *tsk)
+{
+	struct mm_struct *mm;
+	bool ret = false;
+
+	task_lock(tsk);
+	mm = tsk->mm;
+	if (mm && down_read_trylock(&mm->mmap_sem)) {
+		up_read(&mm->mmap_sem);
+		ret = true;
+	}
+	task_unlock(tsk);
+	return ret;
+}
+
+
 enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
 			struct task_struct *task, unsigned long totalpages)
 {
@@ -278,7 +294,8 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
 	 * This task already has access to memory reserves and is being killed.
 	 * Don't allow any other task to have access to the reserves.
 	 */
-	if (test_tsk_thread_flag(task, TIF_MEMDIE)) {
+	if (test_tsk_thread_flag(task, TIF_MEMDIE) &&
+	    task_can_read_lock_mm(task)) {
 		if (!is_sysrq_oom(oc))
 			return OOM_SCAN_ABORT;
 	}
@@ -292,7 +309,8 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
 	if (oom_task_origin(task))
 		return OOM_SCAN_SELECT;
 
-	if (task_will_free_mem(task) && !is_sysrq_oom(oc))
+	if (task_will_free_mem(task) && !is_sysrq_oom(oc) &&
+	    task_can_read_lock_mm(task))
 		return OOM_SCAN_ABORT;
 
 	return OOM_SCAN_OK;
@@ -688,14 +706,16 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	 * If the task is already exiting, don't alarm the sysadmin or kill
 	 * its children or threads, just set TIF_MEMDIE so it can die quickly
 	 */
-	task_lock(p);
-	if (p->mm && task_will_free_mem(p)) {
-		mark_oom_victim(p);
+	if (task_can_read_lock_mm(p)) {
+		task_lock(p);
+		if (p->mm && task_will_free_mem(p)) {
+			mark_oom_victim(p);
+			task_unlock(p);
+			put_task_struct(p);
+			return;
+		}
 		task_unlock(p);
-		put_task_struct(p);
-		return;
 	}
-	task_unlock(p);
 
 	if (__ratelimit(&oom_rs))
 		dump_header(oc, p, memcg);
@@ -868,7 +888,8 @@ bool out_of_memory(struct oom_control *oc)
 	 * TIF_MEMDIE flag at exit_mm(), otherwise an OOM livelock may occur.
 	 */
 	if (current->mm &&
-	    (fatal_signal_pending(current) || task_will_free_mem(current))) {
+	    (fatal_signal_pending(current) || task_will_free_mem(current)) &&
+	    task_can_read_lock_mm(current)) {
 		mark_oom_victim(current);
 		return true;
 	}
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>



[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [ECOS]     [Asterisk Internet PBX]     [Linux API]