Re: [PATCH RFC 06/10] pidfs: allow to retrieve exit information

Christian Brauner <brauner@xxxxxxxxxx> · Mon, 3 Mar 2025 12:32:24 +0100

On Mon, Mar 03, 2025 at 10:06:31AM +0100, Lennart Poettering wrote:
> On So, 02.03.25 21:24, Oleg Nesterov (oleg@xxxxxxxxxx) wrote:
> 
> > This will fix the problem with mt-exec, but this won't help to discriminate
> > the leader-exit and the-whole-group-exit cases...
> >
> > With this this (or something like this) change pidfd_info() can only report
> > the exit code of the already reaped thread/process, leader or not.

Yes, that's fine. I don't think we need to report exit status
information right after the task has exited. It's fine to only provide
it once it has been reaped and it makes things simpler afaict.

Pidfd polling allows waiting on either task exit or for a task to have
been reaped. So the contract for PIDFD_INFO_EXIT is simply that EPOLLHUP
must be observed before exit information can be retrieved.

This aligns with wait() as well, where reaping of a thread-group leader
that exited before the thread-group was empty is delayed until the
thread-group is empty.

I think that with PIDFD_INFO_EXIT autoreaping might actually become
usable because it means a parent can ignore SIGCHLD or set SA_NOCLDWAIT
and simply use pidfd polling and PIDFD_INFO_EXIT to get get status
information from its children. But the kernel will autocleanup right
away instead of delaying. If it's a subreaper there's probably some
wrinkle with grand-children that get reparented to it? But for the
non-subreaper case it should be very useful.

> > I mean... If the leader L exits using sys_exit() and it has the live sub-
> > threads, release_task(L) / __unhash_process(L) will be only called when
> > the last sub-thread exits and it (or debugger) does "goto repeat;" in
> > release_task() to finally reap the leader.
> >
> > IOW. If someone does sys_pidfd_create(group-leader-pid, PIDFD_THREAD),
> > pidfd_info() won't report PIDFD_INFO_EXIT if the leader has exited using
> > sys_exit() before other threads.
> >
> > But perhaps this is fine?
> 
> I think this is fine, but I'd really like a way how userspace can
> determine this state reliably. i.e. a zombie state where the exit
> status is not available yet is a bit strange by classic UNIX
> standards on some level, no?
> 
> But I guess that might not be a pidfd specific issue. i.e. I figure
> classic waitid() with WNOHANG failing on a zombie process that is set
> up like that is a bit weird too, no? Or how does that work there?
> (pretty sure some userspace might not be expecting that...)

Yes, how I read the code WNOHANG exhibits the same behavior (so does WNOWAIT):

        if (exit_state == EXIT_ZOMBIE) {
                /* we don't reap group leaders with subthreads */
                if (!delay_group_leader(p)) {
                        /*
                         * A zombie ptracee is only visible to its ptracer.
                         * Notification and reaping will be cascaded to the
                         * real parent when the ptracer detaches.
                         */
                        if (unlikely(ptrace) || likely(!p->ptrace))
                                return wait_task_zombie(wo, p);
                }