Re: [PATCH v3 1/4] pidfs: improve multi-threaded exec and premature thread-group leader exit polling

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Mar 20, 2025 at 11:57:02AM +0100, Oleg Nesterov wrote:
> Christian,

Oleg!

> 
> All the comments look misleading (and overcomplicated) to me.
> 
> See below, but first lets recall the commit 64bef697d33b75fc06c5789
> ("pidfd: implement PIDFD_THREAD flag for pidfd_open()") which says
> 
>     pidfd: implement PIDFD_THREAD flag for pidfd_open()
> 
>     With this flag:
> 
>             ....
> 
>             - pidfd_poll() succeeds when the task exits and becomes a
>               zombie (iow, passes exit_notify()), even if it is a leader
>               and thread-group is not empty.
> 
> This patch simply reverts this behaviour, the exiting leader will not
> report the exit if it has sub-threads (alive or not). And afaics your
> V1 tried to do the same. And this eliminates the
> 
>               This means that the behaviour of pidfd_poll(PIDFD_THREAD,
>               pid-of-group-leader) is not well defined if it races with
>               exec() from its sub-thread; ...
> 
> problem mentioned in the changelog. That is all.
> 
> IOW, with this change PIDFD_THREAD has no effect.

But that's what I'm trying to say: This patch aligns the behavior of
thread-specific and non-thread-specific thread-group leader pidfds. IOW,
the behavior of:

pidfd_open(<thread-group-leader-pid>, 0)
pidfd_open(<thread-group-leader-pid>, PIDFD_THREAD)

is the same wrt to polling after this patch. That's also what the
selftests are designed to test and show.

> 
> Except the pid_has_task() checks in sys_pidfd_open() paths, without
> PIDFD_THREAD the target task must be a group leader.
> 
> On 03/20, Christian Brauner wrote:
> >
> > @@ -218,12 +218,32 @@ static __poll_t pidfd_poll(struct file *file, struct poll_table_struct *pts)
> 
> Your forgot to remove the no longer used
> 
> 	bool thread = file->f_flags & PIDFD_THREAD;
> 
> above ;)

Thanks, fixed.

> 
> >  	/*
> >  	 * Depending on PIDFD_THREAD, inform pollers when the thread
> >  	 * or the whole thread-group exits.
> 
> See above (and below), this no longer depends on PIDFD_THREAD.
> 
> > +	else if (task->exit_state && !delay_group_leader(task))
> >  		poll_flags = EPOLLIN | EPOLLRDNORM;
> 
> So with this change:
> 
> If the exiting task is a sub-thread, report EPOLLIN as before.
> delay_group_leader() can't be true. In this case PIDFD_THREAD
> must be set.
> 
> If the exiting task is a leader, we do not care about PIDFD_THREAD.
> We report EPOLLIN only if it is the last/only thread.
> 
> > diff --git a/kernel/exit.c b/kernel/exit.c
> > index 9916305e34d3..ce5cdad5ba9c 100644
> > --- a/kernel/exit.c
> > +++ b/kernel/exit.c
> > @@ -271,6 +271,9 @@ void release_task(struct task_struct *p)
> >  		 * If we were the last child thread and the leader has
> >  		 * exited already, and the leader's parent ignores SIGCHLD,
> >  		 * then we are the one who should release the leader.
> > +		 *
> > +		 * This will also wake PIDFD_THREAD pidfds for the
> > +		 * thread-group leader that already exited.
> >  		 */
> >  		zap_leader = do_notify_parent(leader, leader->exit_signal);
> 
> Again, this doesn't depend on PIDFD_THREAD.

The comment is literally just saying that we delayed notification of
PIDFD_THREAD pidfds for a thread-group leader until now. After all its
subthreads are released instead of when the thread-group leader did
actually exit earlier.

> 
> > @@ -743,10 +746,13 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
> >
> >  	tsk->exit_state = EXIT_ZOMBIE;
> >  	/*
> > -	 * sub-thread or delay_group_leader(), wake up the
> > -	 * PIDFD_THREAD waiters.
> > +	 * Wake up PIDFD_THREAD waiters if this is a proper subthread
> > +	 * exit. If this is a premature thread-group leader exit delay
> > +	 * the notification until the last subthread exits. If a
> > +	 * subthread should exec before then no notification will be
> > +	 * generated.
> >  	 */
> > -	if (!thread_group_empty(tsk))
> > +	if (!delay_group_leader(tsk))
> >  		do_notify_pidfd(tsk);
> 
> The same...

What you seem to be saying is that you want all references to
PIDFD_THREAD to be dropped in the comments because the behavior is now
identical. But what I would like to have is comments in the code that
illustrate where and how we guarantee this behavioral equivalency.

Because where the notifications happen does differ.

The delayed thread-group leader stuff is literally only apparent to
anyone who has stared and lived with these horrible behavioral warts of
early thread-group leader exit and subthread exec for a really long
time. For anyone else this isn't self-explanatory at all and each time
one has to go look at it it requires jumping around all the locations
where and how exit notifications are generated and piece together the
whole picture. It is laughably complex to follow.

So I'm wiping the comments but I very much disagree that they are
misleading/useless.

> 
> > --- a/kernel/signal.c
> > +++ b/kernel/signal.c
> > @@ -2180,8 +2180,10 @@ bool do_notify_parent(struct task_struct *tsk, int sig)
> >  	WARN_ON_ONCE(!tsk->ptrace &&
> >  	       (tsk->group_leader != tsk || !thread_group_empty(tsk)));
> >  	/*
> > -	 * tsk is a group leader and has no threads, wake up the
> > -	 * non-PIDFD_THREAD waiters.
> > +	 * This is a thread-group leader without subthreads so wake up
> > +	 * the non-PIDFD_THREAD waiters. This also wakes the
> > +	 * PIDFD_THREAD waiters for the thread-group leader in case it
> > +	 * exited prematurely from release_task().
> >  	 */
> 
> This too.

This one I agree is misplaced. It would be sufficient to have the
comment in release_task().

Christian




[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux