Re: [PATCH RFC v2 1/3] pidfs: improve multi-threaded exec and premature thread-group leader exit polling

Christian Brauner <brauner@xxxxxxxxxx> · Wed, 19 Mar 2025 16:39:50 +0100

On Wed, Mar 19, 2025 at 03:00:52PM +0100, Oleg Nesterov wrote:
> On 03/18, Christian Brauner wrote:
> >
> > @@ -746,8 +751,23 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
> >  	 * sub-thread or delay_group_leader(), wake up the
> >  	 * PIDFD_THREAD waiters.
> >  	 */
> > -	if (!thread_group_empty(tsk))
> > -		do_notify_pidfd(tsk);
> > +	if (!thread_group_empty(tsk)) {
> > +		if (delay_group_leader(tsk)) {
> > +			struct pid *pid;
> > +
> > +			/*
> > +			 * This is a thread-group leader exiting before
> > +			 * all of its subthreads have exited allow pidfd
> > +			 * polling to detect this case and delay exit
> > +			 * notification until the last thread has
> > +			 * exited.
> > +			 */
> > +			pid = task_pid(tsk);
> > +			WRITE_ONCE(pid->delayed_leader, 1);
> 
> This is racy, tsk->exit_state is already set so pidfd_poll() can see
> task->exit_state && !pid->delayed_leader.

You're right. I had not considered that.

> But this is minor. I can't understand all these complications,
> probably because I barely slept tonight ;) I will re-read this patch
> again tomorrow, but could you explain why we can't simply use the
> trivial patch below?

Sure, if that works I'm more than happy if we run with this.

> 
> Oleg.
> ---
> 
> diff --git a/fs/pidfs.c b/fs/pidfs.c
> index d980f779c213..8a95920aed98 100644
> --- a/fs/pidfs.c
> +++ b/fs/pidfs.c
> @@ -210,7 +210,6 @@ static void pidfd_show_fdinfo(struct seq_file *m, struct file *f)
>  static __poll_t pidfd_poll(struct file *file, struct poll_table_struct *pts)
>  {
>  	struct pid *pid = pidfd_pid(file);
> -	bool thread = file->f_flags & PIDFD_THREAD;
>  	struct task_struct *task;
>  	__poll_t poll_flags = 0;
>  
> @@ -223,7 +222,7 @@ static __poll_t pidfd_poll(struct file *file, struct poll_table_struct *pts)
>  	task = pid_task(pid, PIDTYPE_PID);
>  	if (!task)
>  		poll_flags = EPOLLIN | EPOLLRDNORM | EPOLLHUP;
> -	else if (task->exit_state && (thread || thread_group_empty(task)))
> +	else if (task->exit_state && !delay_group_leader(task))
>  		poll_flags = EPOLLIN | EPOLLRDNORM;
>  
>  	return poll_flags;
> diff --git a/kernel/exit.c b/kernel/exit.c
> index 9916305e34d3..356ca41d313b 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -746,7 +746,7 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
>  	 * sub-thread or delay_group_leader(), wake up the
>  	 * PIDFD_THREAD waiters.
>  	 */
> -	if (!thread_group_empty(tsk))
> +	if (!delay_group_leader(tsk))
>  		do_notify_pidfd(tsk);

Two cases we need to handle:

(1) thread-group leader exits prematurely and none of the subthreads
    ever exec. Once the last thread exits it'll notify the
    thread-specific and non-thread specific thread-group leader pidfd
    pollers from release_task().

(2) thread-group leader exits prematurely but one of the subthreads
    later execs. In this case we don't want any exit notification to
    be generated for thread-specific thread-group leaders.

I was concerned that handling (2) would be more complex but it passes
all the new tests so I won't complain about less code needed. ;)

Do you want me to just dump your draft and slap a Co-Developed-by on it?

Another idea I had that I would welcome your thoughts on:

When a task execs we could indicate this by generating a POLLPRI event
on the pidfd. If we wanted to be fine-grained we could generate
POLLPRI | POLLRDUP if a subthread execs. The latter would give userspace
a reliable way to detect this case and figure out that tasks changed
TIDs.