Hi, Dave: Thanks for your comments. You have indeed missed some of the prior reviews and discussions. But that is OK. Please see my replies inline. On 11/28/18 7:19 AM, Dave Martin wrote: > On Tue, Nov 27, 2018 at 10:54:41PM +0000, Enke Chen wrote: >> [Repost as a series, as suggested by Andrew Morton] >> >> For simplicity and consistency, this patch provides an implementation >> for signal-based fault notification prior to the coredump of a child >> process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can >> be used by an application to express its interest and to specify the >> signal for such a notification. >> >> Changes to prctl(2): >> >> PR_SET_PREDUMP_SIG (since Linux 4.20.x) >> Set the child pre-coredump signal of the calling process to >> arg2 (either a signal value in the range 1..maxsig, or 0 to >> clear). This is the signal that the calling process will get >> prior to the coredump of a child process. This value is >> cleared across execve(2), or for the child of a fork(2). >> >> PR_GET_PREDUMP_SIG (since Linux 4.20.x) >> Return the current value of the child pre-coredump signal, >> in the location pointed to by (int *) arg2. >> >> Background: >> >> As the coredump of a process may take time, in certain time-sensitive >> applications it is necessary for a parent process (e.g., a process >> manager) to be notified of a child's imminent death before the coredump >> so that the parent process can act sooner, such as re-spawning an >> application process, or initiating a control-plane fail-over. >> >> One application is BFD. The early fault notification is a critical >> component for maintaining BFD sessions (with a timeout value of >> 50 msec or 100 msec) across a control-plane failure. >> >> Currently there are two ways for a parent process to be notified of a >> child process's state change. One is to use the POSIX signal, and >> another is to use the kernel connector module. The specific events and >> actions are summarized as follows: >> >> Process Event POSIX Signal Connector-based >> ---------------------------------------------------------------------- >> ptrace_attach() do_notify_parent_cldstop() proc_ptrace_connector() >> SIGCHLD / CLD_STOPPED >> >> ptrace_detach() do_notify_parent_cldstop() proc_ptrace_connector() >> SIGCHLD / CLD_CONTINUED >> >> pre_coredump/ N/A proc_coredump_connector() >> get_signal() >> >> post_coredump/ do_notify_parent() proc_exit_connector() >> do_exit() SIGCHLD / exit_signal >> ---------------------------------------------------------------------- >> >> As shown in the table, the signal-based pre-coredump notification is not >> currently available. In some cases using a connector-based notification >> can be quite complicated (e.g., when a process manager is written in shell >> scripts and thus is subject to certain inherent limitations), and a >> signal-based notification would be simpler and better suited. > > Since this is a notification of a change of process status, would it be > more natural to send it through SIGCHLD? > > As with other supplementary child status events, a flag could be added > for wait and sigaction.sa_flags to indicate whether the parent wants > this event to be reported or not. > > Then a suitable CLD_XXX could be defined for this, and we could > piggyback on PR_{SET,GET}_PDEATHSIG rather than having to have something > new. > > (I hadn't been watching this thread closely, so apologies if this has > been discussed already.) Indeed, I defined the signal code CLD_PREDUMP for SIGCHLD initially, but it was removed after discussion: v3 --> v4: Addressed review comments from Oleg Nesterov, and Eric W. Biederman, including: o remove the definition CLD_PREDUMP. --- You can look at the discussions in the email thread, in particular several issues pointed out by Eric Biederman, and my reply to Eric. There are two models 1:1 (one process manager with one child process), and 1:N (one process manager with many child processes). A legacy signal like SIGCHLD would not work in the 1:N model due to compression/loss of legacy signals. One need to use a RT signal in that case. One more point in my reply: When an application chooses a signal for pre-coredump notification, it is much simpler and robust for the signal to be dedicated for that purpose (in the parent) and not be mixed with other semantics. The "signo + pid" should be sufficient for the parent process in both 1:1 and 1:N models. > >> >> Signed-off-by: Enke Chen <enkechen@xxxxxxxxx> >> Reviewed-by: Oleg Nesterov <oleg@xxxxxxxxxx> >> --- >> v4 -> v5: >> Addressed review comments from Oleg Nesterov: >> o use rcu_read_lock instead. >> o revert back to notify the real_parent. >> >> fs/coredump.c | 23 +++++++++++++++++++++++ >> fs/exec.c | 3 +++ >> include/linux/sched/signal.h | 3 +++ >> include/uapi/linux/prctl.h | 4 ++++ >> kernel/sys.c | 13 +++++++++++++ >> 5 files changed, 46 insertions(+) >> >> diff --git a/fs/coredump.c b/fs/coredump.c >> index e42e17e..740b1bb 100644 >> --- a/fs/coredump.c >> +++ b/fs/coredump.c >> @@ -536,6 +536,24 @@ static int umh_pipe_setup(struct subprocess_info *info, struct cred *new) >> return err; >> } >> >> +/* >> + * While do_notify_parent() notifies the parent of a child's death post >> + * its coredump, this function lets the parent (if so desired) know about >> + * the imminent death of a child just prior to its coredump. >> + */ >> +static void do_notify_parent_predump(void) >> +{ >> + struct task_struct *parent; >> + int sig; >> + >> + rcu_read_lock(); >> + parent = rcu_dereference(current->real_parent); >> + sig = parent->signal->predump_signal; >> + if (sig != 0) >> + do_send_sig_info(sig, SEND_SIG_NOINFO, parent, PIDTYPE_TGID); > > Doesn't this send si_code == SI_USER. That seems wrong: the receiving > process wouldn't not be able to distinguish a real pre-coredump > notification from a bogus one sent by kill(2) et The receiving process (i.e., the process manager) knows the PID of all its child processes. Thus it can easily tell whether the signal is from a child or not. > > SEND_SIG_PRIV also looks wrong, because it assumes that the sender is > "the kernel" so there is no si_pid. That is why SEND_SIG_NOINFO is chosen here for carrying the "pid". What matters to the parent process is the "signo + pid" for identifying the child process for the pre-coredump notification. > > This may be another reason for building on top of SIGCHLD which already > has the required (but weird) "si_pid == affected process" semantics, > rather than si_pid indicating the sender. > >> + rcu_read_unlock(); >> +} >> + >> void do_coredump(const kernel_siginfo_t *siginfo) >> { >> struct core_state core_state; >> @@ -590,6 +608,11 @@ void do_coredump(const kernel_siginfo_t *siginfo) >> if (retval < 0) >> goto fail_creds; >> >> + /* >> + * Send the pre-coredump signal to the parent if requested. >> + */ >> + do_notify_parent_predump(); >> + >> old_cred = override_creds(cred); >> >> ispipe = format_corename(&cn, &cprm); >> diff --git a/fs/exec.c b/fs/exec.c >> index fc281b7..7714da7 100644 >> --- a/fs/exec.c >> +++ b/fs/exec.c >> @@ -1181,6 +1181,9 @@ static int de_thread(struct task_struct *tsk) >> /* we have changed execution domain */ >> tsk->exit_signal = SIGCHLD; >> >> + /* Clear the pre-coredump signal before loading a new binary */ >> + sig->predump_signal = 0; >> + >> #ifdef CONFIG_POSIX_TIMERS >> exit_itimers(sig); >> flush_itimer_signals(); >> diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h >> index 13789d1..728ef68 100644 >> --- a/include/linux/sched/signal.h >> +++ b/include/linux/sched/signal.h >> @@ -112,6 +112,9 @@ struct signal_struct { >> int group_stop_count; >> unsigned int flags; /* see SIGNAL_* flags below */ >> >> + /* The signal sent prior to a child's coredump */ >> + int predump_signal; >> + >> /* >> * PR_SET_CHILD_SUBREAPER marks a process, like a service >> * manager, to re-parent orphan (double-forking) child processes >> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h >> index c0d7ea0..79f0a8a 100644 >> --- a/include/uapi/linux/prctl.h >> +++ b/include/uapi/linux/prctl.h >> @@ -219,4 +219,8 @@ struct prctl_mm_map { >> # define PR_SPEC_DISABLE (1UL << 2) >> # define PR_SPEC_FORCE_DISABLE (1UL << 3) >> >> +/* Whether to receive signal prior to child's coredump */ >> +#define PR_SET_PREDUMP_SIG 54 >> +#define PR_GET_PREDUMP_SIG 55 >> + >> #endif /* _LINUX_PRCTL_H */ >> diff --git a/kernel/sys.c b/kernel/sys.c >> index 123bd73..39aa3b8 100644 >> --- a/kernel/sys.c >> +++ b/kernel/sys.c >> @@ -2476,6 +2476,19 @@ int __weak arch_prctl_spec_ctrl_set(struct task_struct *t, unsigned long which, >> return -EINVAL; >> error = arch_prctl_spec_ctrl_set(me, arg2, arg3); >> break; >> + case PR_SET_PREDUMP_SIG: >> + if (arg3 || arg4 || arg5) > > glibc has > > int prctl(int option, ...); > > Some prctls() police extra arguments for zeros, but this means that > the userspace caller also has to supply pointless 0 arguments. > > It's debatable which is the preferred approach. Did you have any > particular rationale for your choice here? > The initial version did not check the values of these unused arguments. But Jann Horn pointed out the new convention is to enforce the 0 values so I followed ... Thanks. -- Enke