On Wed, Mar 19, 2025 at 03:00:52PM +0100, Oleg Nesterov wrote: > On 03/18, Christian Brauner wrote: > > > > @@ -746,8 +751,23 @@ static void exit_notify(struct task_struct *tsk, int group_dead) > > * sub-thread or delay_group_leader(), wake up the > > * PIDFD_THREAD waiters. > > */ > > - if (!thread_group_empty(tsk)) > > - do_notify_pidfd(tsk); > > + if (!thread_group_empty(tsk)) { > > + if (delay_group_leader(tsk)) { > > + struct pid *pid; > > + > > + /* > > + * This is a thread-group leader exiting before > > + * all of its subthreads have exited allow pidfd > > + * polling to detect this case and delay exit > > + * notification until the last thread has > > + * exited. > > + */ > > + pid = task_pid(tsk); > > + WRITE_ONCE(pid->delayed_leader, 1); > > This is racy, tsk->exit_state is already set so pidfd_poll() can see > task->exit_state && !pid->delayed_leader. You're right. I had not considered that. > But this is minor. I can't understand all these complications, > probably because I barely slept tonight ;) I will re-read this patch > again tomorrow, but could you explain why we can't simply use the > trivial patch below? Sure, if that works I'm more than happy if we run with this. > > Oleg. > --- > > diff --git a/fs/pidfs.c b/fs/pidfs.c > index d980f779c213..8a95920aed98 100644 > --- a/fs/pidfs.c > +++ b/fs/pidfs.c > @@ -210,7 +210,6 @@ static void pidfd_show_fdinfo(struct seq_file *m, struct file *f) > static __poll_t pidfd_poll(struct file *file, struct poll_table_struct *pts) > { > struct pid *pid = pidfd_pid(file); > - bool thread = file->f_flags & PIDFD_THREAD; > struct task_struct *task; > __poll_t poll_flags = 0; > > @@ -223,7 +222,7 @@ static __poll_t pidfd_poll(struct file *file, struct poll_table_struct *pts) > task = pid_task(pid, PIDTYPE_PID); > if (!task) > poll_flags = EPOLLIN | EPOLLRDNORM | EPOLLHUP; > - else if (task->exit_state && (thread || thread_group_empty(task))) > + else if (task->exit_state && !delay_group_leader(task)) > poll_flags = EPOLLIN | EPOLLRDNORM; > > return poll_flags; > diff --git a/kernel/exit.c b/kernel/exit.c > index 9916305e34d3..356ca41d313b 100644 > --- a/kernel/exit.c > +++ b/kernel/exit.c > @@ -746,7 +746,7 @@ static void exit_notify(struct task_struct *tsk, int group_dead) > * sub-thread or delay_group_leader(), wake up the > * PIDFD_THREAD waiters. > */ > - if (!thread_group_empty(tsk)) > + if (!delay_group_leader(tsk)) > do_notify_pidfd(tsk); Two cases we need to handle: (1) thread-group leader exits prematurely and none of the subthreads ever exec. Once the last thread exits it'll notify the thread-specific and non-thread specific thread-group leader pidfd pollers from release_task(). (2) thread-group leader exits prematurely but one of the subthreads later execs. In this case we don't want any exit notification to be generated for thread-specific thread-group leaders. I was concerned that handling (2) would be more complex but it passes all the new tests so I won't complain about less code needed. ;) Do you want me to just dump your draft and slap a Co-Developed-by on it? Another idea I had that I would welcome your thoughts on: When a task execs we could indicate this by generating a POLLPRI event on the pidfd. If we wanted to be fine-grained we could generate POLLPRI | POLLRDUP if a subthread execs. The latter would give userspace a reliable way to detect this case and figure out that tasks changed TIDs.