[PATCH] mm,oom: Set TIF_MEMDIE on all OOM-killed threads.

Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx> · Mon, 21 Mar 2016 12:07:14 +0900

If the page allocator was able to declare OOM, and that was an allocation
request which can call the OOM killer, the OOM killer is called. The OOM
killer selects an OOM victim thread and sets TIF_MEMDIE only on that
thread based on assumption that threads with TIF_MEMDIE can terminate
shortly.

As we already know it is possible that TIF_MEMDIE threads cannot make
forward progress because they are blocked at unkillable locks waiting for
somebody else doing !__GFP_FS allocation requests in order to perform fs
writeback or storage I/O operations.

Since we use the same watermark for GFP_NOIO / GFP_NOFS / GFP_KERNEL
allocation requests, allocation requests for fs writeback (GFP_NOFS)
and storage I/O (GFP_NOIO) stop making forward progress when TIF_MEMDIE
threads started waiting for GFP_NOFS or GFP_NOIO allocation requests.

If the OOM reaper clears TIF_MEMDIE and mark that thread group as
OOM-unkillable (or does equivalent things) in order to let the OOM killer
select next OOM victim, allocation requests for fs writeback or storage
I/O operations will eventually succeed.

But current OOM reaper patch happily disables further reaping because it
clears TIF_MEMDIE of only one thread and marks only one thread group as
OOM-unkillable when more than one thread groups are sharing the victim's
memory. This results in OOM-livelock situation. (Though, this behavior is
really helpful for reproducing almost OOM situation for testing purpose.)

If the OOM reaper marks all thread groups as OOM-unkillable, such
situation can be avoided. But, it is a bad assumption that the OOM reaper
will mark all thread groups as OOM-unkillable because there is no
guarantee that the OOM reaper is always woken up. We can fail to avoid
such situation unless we wake up the OOM reaper whenever TIF_MEMDIE is
set regardless of whether the memory used by the victim is reapable (i.e.
let the OOM reaper check whether it is safe to reap the victim's memory)
and clear TIF_MEMDIE and mark as OOM-unkillable (i.e. let the OOM reaper
act as a timer).

Rather, it is more simpler that we do not depend on the OOM reaper for
marking as OOM-unkillable. The purpose of marking as OOM-unkillable is
to prevent the OOM killer from selecting the same thread group forever.
We can mark as OOM-unkillable as of setting TIF_MEMDIE.

Therefore, this patch sets TIF_MEMDIE on all threads of all thread groups
that are killed by oom_kill_process() and marks all thread groups that are
killed by oom_kill_process() as OOM-unkillable when oom_kill_process() is
called.

This patch also solves SysRq-f problem that it continues selecting the
same thread group forever when the OOM reaper was not woken up (likely
because somebody sharing the victim's mm was marked as OOM-unkillable) or
failed to reap that memory (likely because somebody sharing the victim's
mm was holding mm->mmap_sem for write) and we have to use SysRq-f.
(Though, it is really useless for production systems to require SysRq-f.
I really want to allow making forward progress if the OOM reaper was not
woken up or failed to reap that memory.)

This patch also changes sysctl_oom_kill_allocating_task behavior (i.e. all
children will be killed by oom_kill_process() when children are unable to
terminate before current thread calls out_of_memory() again). But given
that the purpose of sysctl_oom_kill_allocating_task is to kill quickly,
it is not a good thing to continue spamming the kernel buffer with OOM
messages (including the list of OOM-killable processes) forever when
the killed child cannot terminate soon.

Setting TIF_MEMDIE on all threads of all thread groups that are killed
by the OOM killer unlikely increases possibility of depletion of memory
reserves because TIF_MEMDIE helps only if that thread is doing memory
allocation. Instead, compared to setting TIF_MEMDIE on randomly chosen one
thread with an assumption that that thread is doing memory allocation that
can call out_of_memory() shortly while it is possible that thread is doing
!__GFP_FS && !__GFP_NOFAIL allocation or is blocked by some unkillable
lock, this change can increase possibility of leaving memory allocating
loop by allowing any victim thread doing memory allocation to access
memory reserves when one of threads which is not selected by current
approach can make forward progress if TIF_MEMDIE is set.

If we set TIF_MEMDIE on all threads of all thread groups that are killed
by oom_kill_process(), we no longer need shortcuts that set TIF_MEMDIE on
only one thread when the victim has pending SIGKILL or is exiting. This
increases the possibility of successfully waking up the OOM reaper by not
hitting the shortcuts.

As a summary, this patch does the following things.

  (1) Set TIF_MEMDIE on all threads which should be terminated
      when oom_kill_process() was called.

  (2) Update "struct signal_struct"->oom_score_adj to OOM_SCORE_ADJ_MIN
      when oom_kill_process() was called.

  (3) Call wake_oom_reaper() for all threads which should be terminated
      when oom_kill_process() was called, for oom_reap_task() is
      responsible for clearing TIF_MEMDIE from all threads after reaping
      the victim's memory completed.

  (4) Allow SysRq-f to select a different thread group when
      oom_reap_task() was not called or failed to reap the victim's
      memory.

  (5) Eliminate shortcuts which set TIF_MEMDIE on only one thread.

If we add a flag to "struct signal_struct" for remembering whether
this thread group was selected for OOM victim, we can stop playing with
"struct signal_struct"->oom_score_adj value. Also, if we add a flag to
"struct signal_struct" or "struct mm_struct" for remembering whether
oom_reap_task() completed reaping this memory, we can stop clearing
TIF_MEMDIE from remote threads. But these changes are outside of this
patch's scope.

Maybe we should make sure that sysctl_oom_kill_allocating_task case
selects the current thread and oom_task_origin() case selects that
thread. But these changes are outside of this patch's scope.

Signed-off-by: Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx>
Cc: Michal Hocko <mhocko@xxxxxxxxxx>
Cc: Johannes Weiner <hannes@xxxxxxxxxxx>
Cc: David Rientjes <rientjes@xxxxxxxxxx>

---
 include/linux/oom.h | 11 -------
 mm/memcontrol.c     | 10 ------
 mm/oom_kill.c       | 93 ++++++++++++++++++++++-------------------------------
 3 files changed, 38 insertions(+), 76 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 45993b8..9e2b524 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -102,17 +102,6 @@ extern void oom_killer_enable(void);
 
 extern struct task_struct *find_lock_task_mm(struct task_struct *p);
 
-static inline bool task_will_free_mem(struct task_struct *task)
-{
-	/*
-	 * A coredumping process may sleep for an extended period in exit_mm(),
-	 * so the oom killer cannot assume that the process will promptly exit
-	 * and release memory.
-	 */
-	return (task->flags & PF_EXITING) &&
-		!(task->signal->flags & SIGNAL_GROUP_COREDUMP);
-}
-
 /* sysctls */
 extern int sysctl_oom_dump_tasks;
 extern int sysctl_oom_kill_allocating_task;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 36db05f..c8433a8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1249,16 +1249,6 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 
 	mutex_lock(&oom_lock);
 
-	/*
-	 * If current has a pending SIGKILL or is exiting, then automatically
-	 * select it.  The goal is to allow it to allocate so that it may
-	 * quickly exit and free its memory.
-	 */
-	if (fatal_signal_pending(current) || task_will_free_mem(current)) {
-		mark_oom_victim(current);
-		goto unlock;
-	}
-
 	check_panic_on_oom(&oc, CONSTRAINT_MEMCG, memcg);
 	totalpages = mem_cgroup_get_limit(memcg) ? : 1;
 	for_each_mem_cgroup_tree(iter, memcg) {
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 23b8b06..bc6874b 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -428,28 +428,17 @@ static bool __oom_reap_task(struct task_struct *tsk)
 	struct mmu_gather tlb;
 	struct vm_area_struct *vma;
 	struct mm_struct *mm;
-	struct task_struct *p;
 	struct zap_details details = {.check_swap_entries = true,
 				      .ignore_dirty = true};
 	bool ret = true;
 
-	/*
-	 * Make sure we find the associated mm_struct even when the particular
-	 * thread has already terminated and cleared its mm.
-	 * We might have race with exit path so consider our work done if there
-	 * is no mm.
-	 */
-	p = find_lock_task_mm(tsk);
-	if (!p)
-		return true;
-
-	mm = p->mm;
-	if (!atomic_inc_not_zero(&mm->mm_users)) {
-		task_unlock(p);
+	task_lock(tsk);
+	mm = tsk->mm;
+	if (!mm || !atomic_inc_not_zero(&mm->mm_users)) {
+		task_unlock(tsk);
 		return true;
 	}
-
-	task_unlock(p);
+	task_unlock(tsk);
 
 	if (!down_read_trylock(&mm->mmap_sem)) {
 		ret = false;
@@ -493,7 +482,6 @@ static bool __oom_reap_task(struct task_struct *tsk)
 	 * improvements. This also means that selecting this task doesn't
 	 * make any sense.
 	 */
-	tsk->signal->oom_score_adj = OOM_SCORE_ADJ_MIN;
 	exit_oom_victim(tsk);
 out:
 	mmput(mm);
@@ -679,19 +667,6 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 					      DEFAULT_RATELIMIT_BURST);
 	bool can_oom_reap;
 
-	/*
-	 * If the task is already exiting, don't alarm the sysadmin or kill
-	 * its children or threads, just set TIF_MEMDIE so it can die quickly
-	 */
-	task_lock(p);
-	if (p->mm && task_will_free_mem(p)) {
-		mark_oom_victim(p);
-		task_unlock(p);
-		put_task_struct(p);
-		return;
-	}
-	task_unlock(p);
-
 	if (__ratelimit(&oom_rs))
 		dump_header(oc, p, memcg);
 
@@ -743,13 +718,6 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	/* Make sure we do not try to oom reap the mm multiple times */
 	can_oom_reap = !test_and_set_bit(MMF_OOM_KILLED, &mm->flags);
 
-	/*
-	 * We should send SIGKILL before setting TIF_MEMDIE in order to prevent
-	 * the OOM victim from depleting the memory reserves from the user
-	 * space under its control.
-	 */
-	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
-	mark_oom_victim(victim);
 	pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
 		task_pid_nr(victim), victim->comm, K(victim->mm->total_vm),
 		K(get_mm_counter(victim->mm, MM_ANONPAGES)),
@@ -770,8 +738,6 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	for_each_process(p) {
 		if (!process_shares_mm(p, mm))
 			continue;
-		if (same_thread_group(p, victim))
-			continue;
 		if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p) ||
 		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
 			/*
@@ -782,12 +748,43 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 			can_oom_reap = false;
 			continue;
 		}
+		/*
+		 * We should send SIGKILL before setting TIF_MEMDIE in order to
+		 * prevent the OOM victim from depleting the memory reserves
+		 * from the user space under its control.
+		 */
 		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
+		for_each_thread(p, t) {
+			task_lock(t);
+			if (t->mm)
+				mark_oom_victim(t);
+			task_unlock(t);
+		}
+		/*
+		 * Mark this process as no longer OOM-killable in order to
+		 * prevent the OOM killer (including SysRq-f) from selecting
+		 * the same process forever.
+		 */
+		task_lock(p);
+		p->signal->oom_score_adj = OOM_SCORE_ADJ_MIN;
+		task_unlock(p);
 	}
 	rcu_read_unlock();
 
-	if (can_oom_reap)
-		wake_oom_reaper(victim);
+	/*
+	 * If we can call the OOM reaper, queue all TIF_MEMDIE threads because
+	 * the OOM reaper is responsible for clearing TIF_MEMDIE in order to
+	 * allow the OOM killer to select next victim if the OOM situation
+	 * still remains after reaping the mm completed.
+	 */
+	if (can_oom_reap) {
+		rcu_read_lock();
+		for_each_process_thread(p, t) {
+			if (t->mm == mm)
+				wake_oom_reaper(t);
+		}
+		rcu_read_unlock();
+	}
 
 	mmdrop(mm);
 	put_task_struct(victim);
@@ -859,20 +856,6 @@ bool out_of_memory(struct oom_control *oc)
 		return true;
 
 	/*
-	 * If current has a pending SIGKILL or is exiting, then automatically
-	 * select it.  The goal is to allow it to allocate so that it may
-	 * quickly exit and free its memory.
-	 *
-	 * But don't select if current has already released its mm and cleared
-	 * TIF_MEMDIE flag at exit_mm(), otherwise an OOM livelock may occur.
-	 */
-	if (current->mm &&
-	    (fatal_signal_pending(current) || task_will_free_mem(current))) {
-		mark_oom_victim(current);
-		return true;
-	}
-
-	/*
 	 * Check if there were limitations on the allocation (only relevant for
 	 * NUMA) that may require different handling.
 	 */
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>