Re: [PATCH 6/5] oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task

Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx> · Wed, 16 Mar 2016 20:16:47 +0900

Michal Hocko wrote:
> On Tue 15-03-16 20:15:24, Tetsuo Handa wrote:
> [...]
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 23b8b06..0464727 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -543,7 +543,7 @@ static int oom_reaper(void *unused)
> >  
> >  static void wake_oom_reaper(struct task_struct *tsk)
> >  {
> > -	if (!oom_reaper_th)
> > +	if (!oom_reaper_th || tsk->oom_reaper_list)
> >  		return;
> >  
> >  	get_task_struct(tsk);
> 
> OK, this is indeed much simpler. Back then when the list_head was used
> this would be harder and I didn't realize moving away from the list_head
> would simplify this as well. Care to send a patch to replace the
> oom-oom_reaper-disable-oom_reaper-for-oom_kill_allocating_task.patch in
> the mmotm tree?

Sent as 1458124527-5441-1-git-send-email-penguin-kernel@xxxxxxxxxxxxxxxxxxx .

> > Two thread groups sharing the same mm can disable the OOM reaper
> > when all threads in the former thread group (which will be chosen
> > as an OOM victim by the OOM killer) can immediately call exit_mm()
> > via do_exit() (e.g. simply sleeping in killable state when the OOM
> > killer chooses that thread group) and some thread in the latter thread
> > group is contended on unkillable locks (e.g. inode mutex), due to
> >
> > 	p = find_lock_task_mm(tsk);
> > 	if (!p)
> > 		return true;
> >
> > in __oom_reap_task() and
> >
> > 	can_oom_reap = !test_and_set_bit(MMF_OOM_KILLED, &mm->flags);
> >
> > in oom_kill_process(). The OOM reaper is woken up in order to reap
> > the former thread group's memory, but it does nothing on the latter
> > thread group's memory because the former thread group can clear its mm
> > before the OOM reaper locks its mm. Even if subsequent out_of_memory()
> > call chose the latter thread group, the OOM reaper will not be woken up.
> > No memory is reaped. We need to queue all thread groups sharing that
> > memory if that memory should be reaped.
>
> Why it wouldn't be enough to wake the oom reaper only for the oom
> victims? If the oom reaper races with the victims exit path then
> the next round of the out_of_memory will select a different thread
> sharing the same mm.

If we use MMF_OOM_KILLED, we can hit below sequence.

(1) first round of out_of_memory() sends SIGKILL to both p0 and p1
    and sets TIF_MEMDIE on p0 and and sets MMF_OOM_KILLED on p0's mm.

(2) p0 releases its mm.

(3) oom_reap_task() does nothing on p1 because p0 released its mm.

(4) second round of out_of_memory() sends SIGKILL to p1 and sets
    TIF_MEMDIE on p1.

(5) oom_reap_task() is not called because p1's mm already has MMF_OOM_KILLED.

(6) oom_reap_task() won't clear TIF_MEMDIE on p1 even if p1 got stuck at
    unkillable lock.

If we get rid of MMF_OOM_KILLED, oom_reap_task() will be called upon
next round of out_of_memory() unless that hits the shortcuts.

>
> And just to prevent from a confusion. I mean waking up also when
> fatal_signal_pending and we do not really go down to selecting an oom
> victim. Which would be worth a separate patch on top of course.

I couldn't understand this part. The shortcut

        if (current->mm &&
            (fatal_signal_pending(current) || task_will_free_mem(current))) {
                mark_oom_victim(current);
                return true;
        }

is not used for !__GFP_FS && !__GFP_NOFAIL allocation requests. I think
we might go down to selecting an oom victim by out_of_memory() calls by
not-yet-killed processes.



Anyway, why do we want to let out_of_memory() try to choose p1 without
giving oom_reap_task() a chance to reap p1's mm?

Here is a pathological situation:

  There are 100 thread groups (p[0] to p[99]) sharing the same memory.
  They all have *p->signal->oom_score_adj == OOM_SCORE_ADJ_MAX.
  Any p[n+1] except p[0] is waiting for p[n] to call mutex_unlock() in

      mutex_lock(&lock);
      ptr = kmalloc(sizeof(*ptr), GFP_NOFS);
      if (ptr)
          do_something(ptr);
      mutex_unlock(&lock);

  sequence, and p[0] is doing kmalloc() before mutex_unlock().

  First round of out_of_memory() selects p[0].
  p[0..99] get SIGKILL and p[0] gets TIF_MEMDIE.
  p[0] releases unkillable lock and terminates and does p[0]->mm = NULL.
  oom_reap_task() did nothing due to p[0]->mm == NULL.
  Next round of out_of_memory() selects p[1].
  p[1..99] get SIGKILL and p[1] gets TIF_MEMDIE.
  p[1] releases unkillable lock and terminates and does p[1]->mm = NULL.
  oom_reap_task() did nothing due to p[1]->mm == NULL.
  Next round of out_of_memory() selects p[2].
  p[2..99] get SIGKILL and p[2] gets TIF_MEMDIE.
  p[2] releases unkillable lock and terminates and does p[2]->mm = NULL.
  oom_reap_task() did nothing due to p[2]->mm == NULL.
  (...snipped...)
  Next round of out_of_memory() selects p[99].
  p[99] gets SIGKILL and p[99] gets TIF_MEMDIE.
  p[99] releases unkillable lock and terminates and does p[99]->mm = NULL.
  oom_reap_task() did nothing due to p[99]->mm == NULL.
  p's memory is released, and out_of_memory() will no longer be called if OOM situation was resolved.

If we give p[0..99] a chance to terminate as soon as possible using
TIF_MEMDIE when first round of out_of_memory() selected p[0], we can
save a lot of out_of_memory() warning messages. I think that setting
TIF_MEMDIE on 100 threads does not increase possibility of depletion
of memory reserves because TIF_MEMDIE helps only if that thread is
doing memory allocation. If 100 threads were concurrently doing __GFP_FS
or __GFP_NOFAIL memory allocation, they will get TIF_MEMDIE after all.
(We can reduce possibility of depletion of memory reserves than now
if __GFP_KILLABLE were available.) Also, by queuing p[0..99] when p[0]
is selected, oom_reap_task() will likely be able to reap p's memory
before 100 times of failing oom_reap_task() attempt complete.

Therefore, I think we can do something like below patch (partial revert
of "oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address
space" with sysctl_oom_kill_allocating_task / oom_task_origin() / SysRq-f
fixes included).

----------

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 45993b8..03e6257 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -91,7 +91,7 @@ extern enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
 
 extern bool out_of_memory(struct oom_control *oc);
 
-extern void exit_oom_victim(struct task_struct *tsk);
+extern void exit_oom_victim(void);
 
 extern int register_oom_notifier(struct notifier_block *nb);
 extern int unregister_oom_notifier(struct notifier_block *nb);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8fb187f..99b60ca 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -786,6 +786,18 @@ struct signal_struct {
 	short oom_score_adj;		/* OOM kill score adjustment */
 	short oom_score_adj_min;	/* OOM kill score adjustment min value.
 					 * Only settable by CAP_SYS_RESOURCE. */
+	/*
+	 * OOM-killer should ignore this process when selecting candidates
+	 * because this process was already OOM-killed. OOM-killer can wait
+	 * for existing TIF_MEMDIE threads unless SysRq-f is requested.
+	 */
+	bool oom_killed;
+	/*
+	 * OOM-killer should not wait for TIF_MEMDIE thread when selecting
+	 * candidates because OOM-reaper already reaped memory used by this
+	 * process.
+	 */
+	bool oom_reap_done;
 
 	struct mutex cred_guard_mutex;	/* guard against foreign influences on
 					 * credential calculations
diff --git a/kernel/exit.c b/kernel/exit.c
index fd90195..953d1a1 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -435,7 +435,7 @@ static void exit_mm(struct task_struct *tsk)
 	mm_update_next_owner(mm);
 	mmput(mm);
 	if (test_thread_flag(TIF_MEMDIE))
-		exit_oom_victim(tsk);
+		exit_oom_victim();
 }
 
 static struct task_struct *find_alive_thread(struct task_struct *p)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 31bbdc6..3261d4a 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -149,6 +149,12 @@ static bool oom_unkillable_task(struct task_struct *p,
 	if (!has_intersects_mems_allowed(p, nodemask))
 		return true;
 
+#ifdef CONFIG_MMU
+	/* See __oom_reap_task(). */
+	if (p->signal->oom_reap_done)
+		return true;
+#endif
+
 	return false;
 }
 
@@ -167,7 +173,7 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
 	long points;
 	long adj;
 
-	if (oom_unkillable_task(p, memcg, nodemask))
+	if (oom_unkillable_task(p, memcg, nodemask) || p->signal->oom_killed)
 		return 0;
 
 	p = find_lock_task_mm(p);
@@ -487,14 +493,11 @@ static bool __oom_reap_task(struct task_struct *tsk)
 	up_read(&mm->mmap_sem);
 
 	/*
-	 * Clear TIF_MEMDIE because the task shouldn't be sitting on a
-	 * reasonably reclaimable memory anymore. OOM killer can continue
-	 * by selecting other victim if unmapping hasn't led to any
-	 * improvements. This also means that selecting this task doesn't
-	 * make any sense.
+	 * This task shouldn't be sitting on a reasonably reclaimable memory
+	 * anymore. OOM killer should consider selecting other victim if
+	 * unmapping hasn't led to any improvements.
 	 */
-	tsk->signal->oom_score_adj = OOM_SCORE_ADJ_MIN;
-	exit_oom_victim(tsk);
+	tsk->signal->oom_reap_done = true;
 out:
 	mmput(mm);
 	return ret;
@@ -521,8 +524,6 @@ static void oom_reap_task(struct task_struct *tsk)
 
 static int oom_reaper(void *unused)
 {
-	set_freezable();
-
 	while (true) {
 		struct task_struct *tsk = NULL;
 
@@ -598,10 +599,9 @@ void mark_oom_victim(struct task_struct *tsk)
 /**
  * exit_oom_victim - note the exit of an OOM victim
  */
-void exit_oom_victim(struct task_struct *tsk)
+void exit_oom_victim(void)
 {
-	if (!test_and_clear_tsk_thread_flag(tsk, TIF_MEMDIE))
-		return;
+	clear_thread_flag(TIF_MEMDIE);
 
 	if (!atomic_dec_return(&oom_victims))
 		wake_up_all(&oom_victims_wait);
@@ -662,6 +662,142 @@ static bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
 	return false;
 }
 
+#ifdef CONFIG_MMU
+/*
+ * We cannot use oom_reaper if the memory is shared by OOM-unkillable process
+ * because OOM-unkillable process wouldn't get killed and so the memory might
+ * be still used.
+ */
+static bool oom_can_reap_mm(struct mm_struct *mm)
+{
+	bool can_oom_reap = true;
+	struct task_struct *p;
+
+	rcu_read_lock();
+	for_each_process(p) {
+		if (!process_shares_mm(p, mm))
+			continue;
+		if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p) ||
+		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
+			can_oom_reap = false;
+			break;
+		}
+	}
+	rcu_read_unlock();
+	return can_oom_reap;
+}
+#else
+static bool oom_can_reap_mm(struct mm_struct *mm)
+{
+	return false;
+}
+#endif
+
+/*
+ * Kill all OOM-killable user threads sharing this victim->mm.
+ *
+ * This mitigates mm->mmap_sem livelock caused by one thread being
+ * unable to release its mm due to being blocked at
+ * down_read(&mm->mmap_sem) in exit_mm() while another thread is doing
+ * __GFP_FS allocation with mm->mmap_sem held for write.
+ *
+ * They all get access to memory reserves.
+ *
+ * This prevents the OOM killer from choosing next OOM victim as soon
+ * as current victim thread released its mm. This also mitigates kernel
+ * log buffer being spammed by OOM killer messages due to choosing next
+ * OOM victim thread sharing the current OOM victim's memory.
+ *
+ * This mitigates the problem that a thread doing __GFP_FS allocation
+ * with mmap_sem held for write cannot call out_of_memory() for
+ * unpredictable duration due to oom_lock contention and/or scheduling
+ * priority, for the OOM reaper will not wait forever until such thread
+ * leaves memory allocating loop by calling out_of_memory(). This also
+ * mitigates the problem that a thread doing !__GFP_FS && !__GFP_NOFAIL
+ * allocation cannot leave memory allocating loop because it cannot
+ * call out_of_memory() even after it is killed.
+ *
+ * They all get excluded from OOM victim selection.
+ *
+ * This mitigates the problem that SysRq-f continues choosing the same
+ * process. We still need SysRq-f because it is possible that victim
+ * threads are blocked at unkillable locks inside memory allocating
+ * loop (e.g. fs writeback from direct reclaim) even after they got
+ * SIGKILL and TIF_MEMDIE.
+ */
+static void oom_kill_mm_users(struct mm_struct *mm)
+{
+	const bool can_oom_reap = oom_can_reap_mm(mm);
+	struct task_struct *p;
+	struct task_struct *t;
+
+	rcu_read_lock();
+	for_each_process(p) {
+		if (!process_shares_mm(p, mm))
+			continue;
+		if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p) ||
+		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
+			continue;
+		/*
+		 * We should send SIGKILL before setting TIF_MEMDIE in order to
+		 * prevent the OOM victim from depleting the memory reserves
+		 * from the user space under its control.
+		 */
+		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
+		p->signal->oom_killed = true;
+		for_each_thread(p, t) {
+			task_lock(t);
+			if (t->mm)
+				mark_oom_victim(t);
+			task_unlock(t);
+		}
+		if (can_oom_reap)
+			wake_oom_reaper(p);
+	}
+	rcu_read_unlock();
+}
+
+/*
+ * If any of p's children has a different mm and is eligible for kill,
+ * the one with the highest oom_badness() score is sacrificed for its
+ * parent. This attempts to lose the minimal amount of work done while
+ * still freeing memory.
+ */
+static struct task_struct *oom_scan_children(struct task_struct *victim,
+					     struct oom_control *oc,
+					     unsigned long totalpages,
+					     struct mem_cgroup *memcg)
+{
+	struct task_struct *p = victim;
+	struct mm_struct *mm = p->mm;
+	struct task_struct *t;
+	struct task_struct *child;
+	unsigned int victim_points = 0;
+
+	read_lock(&tasklist_lock);
+	for_each_thread(p, t) {
+		list_for_each_entry(child, &t->children, sibling) {
+			unsigned int child_points;
+
+			if (process_shares_mm(child, mm))
+				continue;
+			/*
+			 * oom_badness() returns 0 if the thread is unkillable
+			 */
+			child_points = oom_badness(child, memcg, oc->nodemask,
+						   totalpages);
+			if (child_points > victim_points) {
+				put_task_struct(victim);
+				victim = child;
+				victim_points = child_points;
+				get_task_struct(victim);
+			}
+		}
+	}
+	read_unlock(&tasklist_lock);
+	return victim;
+}
+
 /*
  * Must be called while holding a reference to p, which will be released upon
  * returning.
@@ -671,13 +807,16 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 		      struct mem_cgroup *memcg, const char *message)
 {
 	struct task_struct *victim = p;
-	struct task_struct *child;
-	struct task_struct *t;
 	struct mm_struct *mm;
-	unsigned int victim_points = 0;
 	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
 					      DEFAULT_RATELIMIT_BURST);
-	bool can_oom_reap = true;
+	const bool kill_current = !p;
+
+	if (kill_current) {
+		p = current;
+		victim = p;
+		get_task_struct(p);
+	}
 
 	/*
 	 * If the task is already exiting, don't alarm the sysadmin or kill
@@ -698,33 +837,8 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n",
 		message, task_pid_nr(p), p->comm, points);
 
-	/*
-	 * If any of p's children has a different mm and is eligible for kill,
-	 * the one with the highest oom_badness() score is sacrificed for its
-	 * parent.  This attempts to lose the minimal amount of work done while
-	 * still freeing memory.
-	 */
-	read_lock(&tasklist_lock);
-	for_each_thread(p, t) {
-		list_for_each_entry(child, &t->children, sibling) {
-			unsigned int child_points;
-
-			if (process_shares_mm(child, p->mm))
-				continue;
-			/*
-			 * oom_badness() returns 0 if the thread is unkillable
-			 */
-			child_points = oom_badness(child, memcg, oc->nodemask,
-								totalpages);
-			if (child_points > victim_points) {
-				put_task_struct(victim);
-				victim = child;
-				victim_points = child_points;
-				get_task_struct(victim);
-			}
-		}
-	}
-	read_unlock(&tasklist_lock);
+	if (!kill_current && !oom_task_origin(victim))
+		victim = oom_scan_children(victim, oc, totalpages, memcg);
 
 	p = find_lock_task_mm(victim);
 	if (!p) {
@@ -739,54 +853,17 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	/* Get a reference to safely compare mm after task_unlock(victim) */
 	mm = victim->mm;
 	atomic_inc(&mm->mm_count);
-	/*
-	 * We should send SIGKILL before setting TIF_MEMDIE in order to prevent
-	 * the OOM victim from depleting the memory reserves from the user
-	 * space under its control.
-	 */
-	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
-	mark_oom_victim(victim);
 	pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
 		task_pid_nr(victim), victim->comm, K(victim->mm->total_vm),
 		K(get_mm_counter(victim->mm, MM_ANONPAGES)),
 		K(get_mm_counter(victim->mm, MM_FILEPAGES)),
 		K(get_mm_counter(victim->mm, MM_SHMEMPAGES)));
 	task_unlock(victim);
+	put_task_struct(victim);
 
-	/*
-	 * Kill all user processes sharing victim->mm in other thread groups, if
-	 * any.  They don't get access to memory reserves, though, to avoid
-	 * depletion of all memory.  This prevents mm->mmap_sem livelock when an
-	 * oom killed thread cannot exit because it requires the semaphore and
-	 * its contended by another thread trying to allocate memory itself.
-	 * That thread will now get access to memory reserves since it has a
-	 * pending fatal signal.
-	 */
-	rcu_read_lock();
-	for_each_process(p) {
-		if (!process_shares_mm(p, mm))
-			continue;
-		if (same_thread_group(p, victim))
-			continue;
-		if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p) ||
-		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
-			/*
-			 * We cannot use oom_reaper for the mm shared by this
-			 * process because it wouldn't get killed and so the
-			 * memory might be still used.
-			 */
-			can_oom_reap = false;
-			continue;
-		}
-		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
-	}
-	rcu_read_unlock();
-
-	if (can_oom_reap)
-		wake_oom_reaper(victim);
+	oom_kill_mm_users(mm);
 
 	mmdrop(mm);
-	put_task_struct(victim);
 }
 #undef K
 
@@ -880,8 +957,7 @@ bool out_of_memory(struct oom_control *oc)
 	if (sysctl_oom_kill_allocating_task && current->mm &&
 	    !oom_unkillable_task(current, NULL, oc->nodemask) &&
 	    current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
-		get_task_struct(current);
-		oom_kill_process(oc, current, 0, totalpages, NULL,
+		oom_kill_process(oc, NULL, 0, totalpages, NULL,
 				 "Out of memory (oom_kill_allocating_task)");
 		return true;
 	}
----------

If we can tolerate lack of process name and its pid when reporting
success/failure (or we pass them via mm_struct or walk the process list or
whatever else), I think we can do something like below patch (most revert of
"oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space").

----------
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 944b2b3..a52ada1 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -509,6 +509,14 @@ struct mm_struct {
 #ifdef CONFIG_HUGETLB_PAGE
 	atomic_long_t hugetlb_usage;
 #endif
+#ifdef CONFIG_MMU
+	/*
+	 * If this field refers self, OOM-killer should not wait for TIF_MEMDIE
+	 * thread using this mm_struct when selecting candidates, for
+	 * OOM-reaper already reaped memory used by this mm_struct.
+	 */
+	struct mm_struct *oom_reaper_list;
+#endif
 };
 
 static inline void mm_init_cpumask(struct mm_struct *mm)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 9f30cae..d13163a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -792,12 +792,6 @@ struct signal_struct {
 	 * for existing TIF_MEMDIE threads unless SysRq-f is requested.
 	 */
 	bool oom_killed;
-	/*
-	 * OOM-killer should not wait for TIF_MEMDIE thread when selecting
-	 * candidates because OOM-reaper already reaped memory used by this
-	 * process.
-	 */
-	bool oom_reap_done;
 
 	struct mutex cred_guard_mutex;	/* guard against foreign influences on
 					 * credential calculations
@@ -1861,9 +1855,6 @@ struct task_struct {
 	unsigned long	task_state_change;
 #endif
 	int pagefault_disabled;
-#ifdef CONFIG_MMU
-	struct task_struct *oom_reaper_list;
-#endif
 /* CPU-specific state of this task */
 	struct thread_struct thread;
 /*
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 3261d4a..2199c71 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -150,9 +150,15 @@ static bool oom_unkillable_task(struct task_struct *p,
 		return true;
 
 #ifdef CONFIG_MMU
-	/* See __oom_reap_task(). */
-	if (p->signal->oom_reap_done)
-		return true;
+	/* p's memory might be already reclaimed by the OOM reaper. */
+	p = find_lock_task_mm(p);
+	if (p) {
+		struct mm_struct *mm = p->mm;
+		const bool unkillable = (mm->oom_reaper_list == mm);
+
+		task_unlock(p);
+		return unkillable;
+	}
 #endif
 
 	return false;
@@ -425,37 +431,21 @@ bool oom_killer_disabled __read_mostly;
  */
 static struct task_struct *oom_reaper_th;
 static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
-static struct task_struct *oom_reaper_list;
+static struct mm_struct *oom_reaper_list;
 static DEFINE_SPINLOCK(oom_reaper_lock);
 
 
-static bool __oom_reap_task(struct task_struct *tsk)
+static bool __oom_reap_vmas(struct mm_struct *mm)
 {
 	struct mmu_gather tlb;
 	struct vm_area_struct *vma;
-	struct mm_struct *mm;
-	struct task_struct *p;
 	struct zap_details details = {.check_swap_entries = true,
 				      .ignore_dirty = true};
 	bool ret = true;
 
-	/*
-	 * Make sure we find the associated mm_struct even when the particular
-	 * thread has already terminated and cleared its mm.
-	 * We might have race with exit path so consider our work done if there
-	 * is no mm.
-	 */
-	p = find_lock_task_mm(tsk);
-	if (!p)
-		return true;
-
-	mm = p->mm;
-	if (!atomic_inc_not_zero(&mm->mm_users)) {
-		task_unlock(p);
+	/* We might have raced with exit path */
+	if (!atomic_inc_not_zero(&mm->mm_users))
 		return true;
-	}
-
-	task_unlock(p);
 
 	if (!down_read_trylock(&mm->mmap_sem)) {
 		ret = false;
@@ -485,73 +475,75 @@ static bool __oom_reap_task(struct task_struct *tsk)
 		}
 	}
 	tlb_finish_mmu(&tlb, 0, -1);
-	pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
-			task_pid_nr(tsk), tsk->comm,
-			K(get_mm_counter(mm, MM_ANONPAGES)),
-			K(get_mm_counter(mm, MM_FILEPAGES)),
-			K(get_mm_counter(mm, MM_SHMEMPAGES)));
+	pr_info("oom_reaper: now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
+		K(get_mm_counter(mm, MM_ANONPAGES)),
+		K(get_mm_counter(mm, MM_FILEPAGES)),
+		K(get_mm_counter(mm, MM_SHMEMPAGES)));
 	up_read(&mm->mmap_sem);
 
 	/*
-	 * This task shouldn't be sitting on a reasonably reclaimable memory
-	 * anymore. OOM killer should consider selecting other victim if
+	 * This mm no longer has reasonably reclaimable memory.
+	 * OOM killer should consider selecting other victim if
 	 * unmapping hasn't led to any improvements.
 	 */
-	tsk->signal->oom_reap_done = true;
+	mm->oom_reaper_list = mm;
 out:
 	mmput(mm);
 	return ret;
 }
 
 #define MAX_OOM_REAP_RETRIES 10
-static void oom_reap_task(struct task_struct *tsk)
+static void oom_reap_vmas(struct mm_struct *mm)
 {
 	int attempts = 0;
 
 	/* Retry the down_read_trylock(mmap_sem) a few times */
-	while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task(tsk))
+	while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_vmas(mm))
 		schedule_timeout_idle(HZ/10);
 
 	if (attempts > MAX_OOM_REAP_RETRIES) {
-		pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
-				task_pid_nr(tsk), tsk->comm);
+		pr_info("oom_reaper: unable to reap memory\n");
 		debug_show_all_locks();
 	}
 
 	/* Drop a reference taken by wake_oom_reaper */
-	put_task_struct(tsk);
+	mmdrop(mm);
 }
 
 static int oom_reaper(void *unused)
 {
 	while (true) {
-		struct task_struct *tsk = NULL;
+		struct mm_struct *mm = NULL;
 
 		wait_event_freezable(oom_reaper_wait, oom_reaper_list != NULL);
 		spin_lock(&oom_reaper_lock);
 		if (oom_reaper_list != NULL) {
-			tsk = oom_reaper_list;
-			oom_reaper_list = tsk->oom_reaper_list;
+			mm = oom_reaper_list;
+			oom_reaper_list = mm->oom_reaper_list;
 		}
 		spin_unlock(&oom_reaper_lock);
 
-		if (tsk)
-			oom_reap_task(tsk);
+		if (mm)
+			oom_reap_vmas(mm);
 	}
 
 	return 0;
 }
 
-static void wake_oom_reaper(struct task_struct *tsk)
+static void wake_oom_reaper(struct mm_struct *mm)
 {
-	if (!oom_reaper_th || tsk->oom_reaper_list)
+	if (!oom_reaper_th || mm->oom_reaper_list)
 		return;
 
-	get_task_struct(tsk);
+	/*
+	 * Pin the given mm. Use mm_count instead of mm_users because
+	 * we do not want to delay the address space tear down.
+	 */
+	atomic_inc(&mm->mm_count);
 
 	spin_lock(&oom_reaper_lock);
-	tsk->oom_reaper_list = oom_reaper_list;
-	oom_reaper_list = tsk;
+	mm->oom_reaper_list = oom_reaper_list;
+	oom_reaper_list = mm;
 	spin_unlock(&oom_reaper_lock);
 	wake_up(&oom_reaper_wait);
 }
@@ -568,7 +560,7 @@ static int __init oom_init(void)
 }
 subsys_initcall(oom_init)
 #else
-static void wake_oom_reaper(struct task_struct *tsk)
+static void wake_oom_reaper(struct mm_struct *mm)
 {
 }
 #endif
@@ -662,37 +654,6 @@ static bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
 	return false;
 }
 
-#ifdef CONFIG_MMU
-/*
- * We cannot use oom_reaper if the memory is shared by OOM-unkillable process
- * because OOM-unkillable process wouldn't get killed and so the memory might
- * be still used.
- */
-static bool oom_can_reap_mm(struct mm_struct *mm)
-{
-	bool can_oom_reap = true;
-	struct task_struct *p;
-
-	rcu_read_lock();
-	for_each_process(p) {
-		if (!process_shares_mm(p, mm))
-			continue;
-		if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p) ||
-		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
-			can_oom_reap = false;
-			break;
-		}
-	}
-	rcu_read_unlock();
-	return can_oom_reap;
-}
-#else
-static bool oom_can_reap_mm(struct mm_struct *mm)
-{
-	return false;
-}
-#endif
-
 /*
  * Kill all OOM-killable user threads sharing this victim->mm.
  *
@@ -727,7 +688,7 @@ static bool oom_can_reap_mm(struct mm_struct *mm)
  */
 static void oom_kill_mm_users(struct mm_struct *mm)
 {
-	const bool can_oom_reap = oom_can_reap_mm(mm);
+	bool can_oom_reap = true;
 	struct task_struct *p;
 	struct task_struct *t;
 
@@ -736,8 +697,16 @@ static void oom_kill_mm_users(struct mm_struct *mm)
 		if (!process_shares_mm(p, mm))
 			continue;
 		if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p) ||
-		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
+		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
+			/*
+			 * We cannot use oom_reaper if the memory is shared by
+			 * OOM-unkillable process because OOM-unkillable
+			 * process wouldn't get killed and so the memory might
+			 * be still used.
+			 */
+			can_oom_reap = false;
 			continue;
+		}
 		/*
 		 * We should send SIGKILL before setting TIF_MEMDIE in order to
 		 * prevent the OOM victim from depleting the memory reserves
@@ -751,10 +720,10 @@ static void oom_kill_mm_users(struct mm_struct *mm)
 				mark_oom_victim(t);
 			task_unlock(t);
 		}
-		if (can_oom_reap)
-			wake_oom_reaper(p);
 	}
 	rcu_read_unlock();
+	if (can_oom_reap)
+		wake_oom_reaper(mm);
 }
 
 /*
----------

Well, I wanted to split oom_kill_process() into several small
functions before we start applying OOM reaper patches...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>