[PATCH 1/2] signal: Move stopping for the coredump from do_exit into get_signal

"Eric W. Biederman" <ebiederm@xxxxxxxxxxxx> · Wed, 20 Jul 2022 11:50:48 -0500

Stopping to participate in a coredump from a kernel oops makes no
sense and is actively dangerous because the kernel is known to be
broken.  Considering to stop in a coredump from a kernel thread exit
is silly because userspace coredumps are not generated from kernel
threads.  Not stopping for a coredump in exit(2) and exit_group(2) and
related userspace exits that call do_exit or do_group_exit directly is
the current behavior of the code as the PF_SIGNALED test in
coredump_task_exit attests.

Since only tasks that pass through get_signal and set PF_SIGNALED can
join coredumps move stopping for coredumps into get_signal, where the
PF_SIGNALED test is unnecessary.  This avoids even the potential of
stopping for coredumps in the silly or dangerous places.

This can be seen to be safe by examining the few places that call do_exit:

- get_signal calling do_group_exit
  Called by get_signal to terminate the userspace process.  As stopping
  for the coredump happens now happens in get_signal the code will
  continue to participate in the coredump.

- exit_group(2) calling do_group_exit

  If a thread calls exit_group(2) while another thread in the same process
  is performing a coredump there is a race.  The thread that wins the
  race will take the lock and set SIGNAL_GROUP_EXIT.  If it is the
  thread that called do_group_exit then zap_threads will return -EAGAIN
  and no coredump will be generated.  If it is the thread that is
  coredumping that wins the race, the task that called do_group_exit
  will exit gracefully with an error code before the coredump begins.

  Having a single thread exit just before the coredump starts is not
  ideal as the semantics make no sense. (Did the group exit happen
  before the coredump or did the coredump happen before the group
  exit?).

  Eventually I intend for group exits to flow through get_signal and
  this silliness will no longer be possible.  Until then the current
  behavior when this race occurs is maintained.

- io_uring
  Called after get_signal returns to terminate the I/O worker thread
  (essentially a userspace thread that only runs kernel code) so that
  additional cleanup code can be run before do_exit.  As get_signal is
  called the prior to do_exit code will continue to participate in the
  coredump.

- make_task_dead
  Called on an unhandled kernel or hardware failure.  As the failure
  is unhandled any extra work has the potential to make the failure worse
  so being part of a coredump is not appropriate.

- kthread_exit
  Called to terminate a kernel thread as such coredumps do not exist.

- call_usermodehelper_exec_async
  Called to terminate a kernel thread if kerenel_execve fails, as it is a
  kernel thread coredumps do not exist.

- reboot, seeccomp
  For these calls of do_exit() they are semantically direct calls of
  exit(2) today.  As do_exit() does not synchronize with siglock there
  is no logical race between a coredump killing the thread and these
  threads exiting.  These threads logically exit before the coredump
  happens.  This is also the current behavior so there is nothing to
  be concerned about with respect to userspsace semantics or
  regresssions.

Moving the coredump stop for userspace threads that did not dequeue
the coredumping signal from from do_exit into get_signal in general is
safe, because the coredump in the single threaded case completely
happens in get_signal.  The code movement ensures that a
multi-threaded coredump will not have any issues because the
additional threads stop after some amount of cleanup has been done.

The coredump code is robust to all kinds of userspace changes
happening in parallel as multiple processes can share a mm.  This
makes the it safe to perform the coredump before the io_uring cleanup
happens as io_uring can't do anything another process sharing the mm
would not be doing.

Signed-off-by: "Eric W. Biederman" <ebiederm@xxxxxxxxxxxx>
---
 fs/coredump.c            | 25 ++++++++++++++++++++++++-
 include/linux/coredump.h |  2 ++
 kernel/exit.c            | 29 +++++------------------------
 kernel/signal.c          |  5 +++++
 mm/oom_kill.c            |  2 +-
 5 files changed, 37 insertions(+), 26 deletions(-)

diff --git a/fs/coredump.c b/fs/coredump.c
index b836948c9543..67dda77c500f 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -389,6 +389,29 @@ static int zap_threads(struct task_struct *tsk,
 	return nr;
 }
 
+void coredump_join(struct core_state *core_state)
+{
+	/* Stop and join the in-progress coredump */
+	struct core_thread self;
+
+	self.task = current;
+	self.next = xchg(&core_state->dumper.next, &self);
+	/*
+	 * Implies mb(), the result of xchg() must be visible
+	 * to core_state->dumper.
+	 */
+	if (atomic_dec_and_test(&core_state->nr_threads))
+		complete(&core_state->startup);
+
+	for (;;) {
+		set_current_state(TASK_UNINTERRUPTIBLE);
+		if (!self.task) /* see coredump_finish() */
+			break;
+		freezable_schedule();
+	}
+	__set_current_state(TASK_RUNNING);
+}
+
 static int coredump_wait(int exit_code, struct core_state *core_state)
 {
 	struct task_struct *tsk = current;
@@ -436,7 +459,7 @@ static void coredump_finish(bool core_dumped)
 		next = curr->next;
 		task = curr->task;
 		/*
-		 * see coredump_task_exit(), curr->task must not see
+		 * see coredump_join(), curr->task must not see
 		 * ->task == NULL before we read ->next.
 		 */
 		smp_mb();
diff --git a/include/linux/coredump.h b/include/linux/coredump.h
index 08a1d3e7e46d..815d6099b757 100644
--- a/include/linux/coredump.h
+++ b/include/linux/coredump.h
@@ -40,8 +40,10 @@ extern int dump_emit(struct coredump_params *cprm, const void *addr, int nr);
 extern int dump_align(struct coredump_params *cprm, int align);
 int dump_user_range(struct coredump_params *cprm, unsigned long start,
 		    unsigned long len);
+extern void coredump_join(struct core_state *core_state);
 extern void do_coredump(const kernel_siginfo_t *siginfo);
 #else
+extern inline void coredump_join(struct core_state *core_state) {}
 static inline void do_coredump(const kernel_siginfo_t *siginfo) {}
 #endif
 
diff --git a/kernel/exit.c b/kernel/exit.c
index d8ecbaa514f7..2218ca02ac71 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -352,35 +352,16 @@ static void coredump_task_exit(struct task_struct *tsk)
 	 * We must hold siglock around checking core_state
 	 * and setting PF_POSTCOREDUMP.  The core-inducing thread
 	 * will increment ->nr_threads for each thread in the
-	 * group without PF_POSTCOREDUMP set.
+	 * group without PF_POSTCOREDUMP set.  Decrement ->nr_threads
+	 * and possibly complete core_state->startup to politely skip
+	 * participating in any pending coredumps.
 	 */
 	spin_lock_irq(&tsk->sighand->siglock);
 	tsk->flags |= PF_POSTCOREDUMP;
 	core_state = tsk->signal->core_state;
+	if (core_state && atomic_dec_and_test(&core_state->nr_threads))
+		complete(&core_state->startup);
 	spin_unlock_irq(&tsk->sighand->siglock);
-	if (core_state) {
-		struct core_thread self;
-
-		self.task = current;
-		if (self.task->flags & PF_SIGNALED)
-			self.next = xchg(&core_state->dumper.next, &self);
-		else
-			self.task = NULL;
-		/*
-		 * Implies mb(), the result of xchg() must be visible
-		 * to core_state->dumper.
-		 */
-		if (atomic_dec_and_test(&core_state->nr_threads))
-			complete(&core_state->startup);
-
-		for (;;) {
-			set_current_state(TASK_UNINTERRUPTIBLE);
-			if (!self.task) /* see coredump_finish() */
-				break;
-			freezable_schedule();
-		}
-		__set_current_state(TASK_RUNNING);
-	}
 }
 
 #ifdef CONFIG_MEMCG
diff --git a/kernel/signal.c b/kernel/signal.c
index 8a0f114d00e0..8595c935027e 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2687,6 +2687,7 @@ bool get_signal(struct ksignal *ksig)
 	}
 
 	for (;;) {
+		struct core_state *core_state;
 		struct k_sigaction *ka;
 		enum pid_type type;
 
@@ -2820,6 +2821,7 @@ bool get_signal(struct ksignal *ksig)
 		}
 
 	fatal:
+		core_state = signal->core_state;
 		spin_unlock_irq(&sighand->siglock);
 		if (unlikely(cgroup_task_frozen(current)))
 			cgroup_leave_frozen(true);
@@ -2842,6 +2844,9 @@ bool get_signal(struct ksignal *ksig)
 			 * that value and ignore the one we pass it.
 			 */
 			do_coredump(&ksig->info);
+		} else if (core_state) {
+			/* Wait for the coredump to happen */
+			coredump_join(core_state);
 		}
 
 		/*
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 3c6cf9e3cd66..1bb689fd9f81 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -845,7 +845,7 @@ static inline bool __task_will_free_mem(struct task_struct *task)
 
 	/*
 	 * A coredumping process may sleep for an extended period in
-	 * coredump_task_exit(), so the oom killer cannot assume that
+	 * get_signal(), so the oom killer cannot assume that
 	 * the process will promptly exit and release memory.
 	 */
 	if (sig->core_state)
-- 
2.35.3