Re: How to handle TIF_MEMDIE stalls?

Johannes Weiner <hannes@xxxxxxxxxxx> · Mon, 16 Feb 2015 10:42:01 -0500

On Mon, Feb 16, 2015 at 08:23:16PM +0900, Tetsuo Handa wrote:
>   (2) Implement TIF_MEMDIE timeout.

How about something like this?  This should solve the deadlock problem
in the page allocator, but it would also simplify the memcg OOM killer
and allow its use by in-kernel faults again.

--
>From bb7a661bf4802e00197b08068cf2a0c33db89e85 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@xxxxxxxxxxx>
Date: Mon, 16 Feb 2015 08:00:16 -0500
Subject: [patch] mm: oom_kill: fix allocation deadlock by ensuring OOM
 progress

The page allocator loops indefinitely for certain allocations, such as
orders below a certain threshold and __GFP_NOFAIL contexts.  However,
in an OOM situation, the OOM killer picks one victim and also waits
indefinitely for that victim to exit.  If the victim now gets stuck on
a state (e.g. a lock) held by the allocating task, the two deadlock.

The reason the allocator keeps looping is that __GFP_NOFAIL callers
can not handle failures at all.  But even without that, we generally
would prefer killing userspace before failing low-order allocations.

The reason the OOM killer waits for a single OOM victim on the other
hand is because a) multiple racing allocations shouldn't trigger a
storm of kills when a single kill would have sufficed and b) it hands
out access to memory reserves to the dying task and it doesn't want to
risk depleting them with multiple tasks, which could result in, ahem,
an OOM deadlock.

Implement a 5 second time-out for the OOM victim.  If it hasn't exited
by then, revoke its access to memory reserves and pick the second-best
match from the process list.  It might have already dipped into the
reserves at this point, but the risk of depleting them before somebody
exits seems lower than the chance that the current victim will exit.

This can theoretically go as far as going through the entire list of
possible candidates, at which point the machine would panic.  But it
should be rare, and still preferable to a silent hang.

Not-yet-signed-off-by: Johannes Weiner <hannes@xxxxxxxxxxx>
---
 include/linux/oom.h   | 17 ++++++++++++++++-
 include/linux/sched.h |  3 +++
 kernel/exit.c         |  3 +--
 mm/oom_kill.c         | 34 +++++++++++++++++++++++++++++-----
 mm/page_alloc.c       |  2 +-
 5 files changed, 50 insertions(+), 9 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index d5771bed59c9..d6a0af54adfe 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -48,8 +48,23 @@ static inline bool oom_task_origin(const struct task_struct *p)
 }
 
 extern void mark_tsk_oom_victim(struct task_struct *tsk);
+extern void unmark_tsk_oom_victim(struct task_struct *tsk);
 
-extern void unmark_oom_victim(void);
+static inline bool task_oom_victim(struct task_struct *task)
+{
+	if (!test_tsk_thread_flag(task, TIF_MEMDIE))
+		return false;
+	smp_rmb(); /* order against setting expiry -> flag */
+	if (time_after(jiffies, task->memdie_expiry))
+		return false;
+	return true;
+}
+
+static inline void exit_task_oom(struct task_struct *task)
+{
+	if (test_tsk_thread_flag(task, TIF_MEMDIE))
+		unmark_tsk_oom_victim(task);
+}
 
 extern unsigned long oom_badness(struct task_struct *p,
 		struct mem_cgroup *memcg, const nodemask_t *nodemask,
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 048b91b983ed..75066b9070f8 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1550,6 +1550,9 @@ struct task_struct {
 	unsigned long ptrace_message;
 	siginfo_t *last_siginfo; /* For ptrace use.  */
 	struct task_io_accounting ioac;
+
+	unsigned long memdie_expiry;
+
 #if defined(CONFIG_TASK_XACCT)
 	u64 acct_rss_mem1;	/* accumulated rss usage */
 	u64 acct_vm_mem1;	/* accumulated virtual memory usage */
diff --git a/kernel/exit.c b/kernel/exit.c
index feff10bbb307..76b1fa925d59 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -435,8 +435,7 @@ static void exit_mm(struct task_struct *tsk)
 	task_unlock(tsk);
 	mm_update_next_owner(mm);
 	mmput(mm);
-	if (test_thread_flag(TIF_MEMDIE))
-		unmark_oom_victim();
+	exit_task_oom(current);
 }
 
 static struct task_struct *find_alive_thread(struct task_struct *p)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 642f38cb175a..112d3e8e9862 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -262,13 +262,32 @@ enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
 		return OOM_SCAN_CONTINUE;
 
 	/*
-	 * This task already has access to memory reserves and is being killed.
-	 * Don't allow any other task to have access to the reserves.
+	 * If the task was previously selected by the OOM killer, give
+	 * it a chance to exit so that racing allocations don't kill a
+	 * myriad of tasks when the memory of one would have sufficed.
+	 *
+	 * However, it's a sad fact of life that some page allocations
+	 * can't fail and loop in the allocator indefinitely, and the
+	 * first choice of OOM victim might well get stuck behind some
+	 * state (e.g. a lock) that the allocating task is holding.
+	 *
+	 * So give each victim some time before eventually moving on
+	 * to the next best pick.  Until we run out of eligible tasks
+	 * and panic, but that is quite unlikely and still preferable
+	 * to a quiet deadlock.
 	 */
 	if (test_tsk_thread_flag(task, TIF_MEMDIE)) {
+		smp_rmb(); /* order against setting expiry -> flag */
+		if (time_after(jiffies, task->memdie_expiry)) {
+			pr_err("OOM victim process %d (%s) appears stuck:\n",
+			       task_pid_nr(task), task->comm);
+			show_stack(task, NULL);
+			return OOM_SCAN_CONTINUE;
+		}
 		if (!force_kill)
 			return OOM_SCAN_ABORT;
 	}
+
 	if (!task->mm)
 		return OOM_SCAN_CONTINUE;
 
@@ -417,9 +436,13 @@ static DECLARE_RWSEM(oom_sem);
 void mark_tsk_oom_victim(struct task_struct *tsk)
 {
 	WARN_ON(oom_killer_disabled);
+
+	tsk->memdie_expiry = jiffies + 5*HZ;
+	smp_wmb(); /* order against testing flag -> expiry */
 	/* OOM killer might race with memcg OOM */
 	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
 		return;
+
 	/*
 	 * Make sure that the task is woken up from uninterruptible sleep
 	 * if it is frozen because OOM killer wouldn't be able to free
@@ -431,13 +454,14 @@ void mark_tsk_oom_victim(struct task_struct *tsk)
 }
 
 /**
- * unmark_oom_victim - unmarks the current task as OOM victim.
+ * unmark_oom_victim - unmarks the given task as OOM victim.
+ * @tsk: task to unmark
  *
  * Wakes up all waiters in oom_killer_disable()
  */
-void unmark_oom_victim(void)
+void unmark_tsk_oom_victim(struct task_struct *tsk)
 {
-	if (!test_and_clear_thread_flag(TIF_MEMDIE))
+	if (!test_and_clear_tsk_thread_flag(tsk, TIF_MEMDIE))
 		return;
 
 	down_read(&oom_sem);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index cb4758263f6b..db7c644c267a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2572,7 +2572,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
 			alloc_flags |= ALLOC_NO_WATERMARKS;
 		else if (!in_interrupt() &&
 				((current->flags & PF_MEMALLOC) ||
-				 unlikely(test_thread_flag(TIF_MEMDIE))))
+				 unlikely(task_oom_victim(current))))
 			alloc_flags |= ALLOC_NO_WATERMARKS;
 	}
 #ifdef CONFIG_CMA
-- 
2.3.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>