On Sun, Mar 02, 2025 at 04:53:46PM +0100, Oleg Nesterov wrote: > On 02/28, Christian Brauner wrote: > > > > Some tools like systemd's jounral need to retrieve the exit and cgroup > > information after a process has already been reaped. > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > But unless I am totally confused do_exit() calls pidfd_exit() even > before exit_notify(), the exiting task is not even zombie yet. It > will reaped only when it passes exit_notify() and its parent does > wait(). The overall goal is that it's possible to retrieve exit status and cgroupid even if the task has already been reaped. It's intentionally placed before exit_notify(), i.e., before the task is a zombie because exit_notify() wakes pidfd-pollers. Ideally, pidfd pollers would be woken and then could use the PIDFD_GET_INFO ioctl to retrieve the exit status. It would however be fine to place it into exit_notify() if it's a better fit there. If you have a preference let me know. I don't see a reason why seeing the exit status before that would be an issue. The only downside would be that some other task that just keeps ioctl()ing in a loop would possible see the exit status before the parent does. But I didn't think this would be a big issue. > And what about the multi-threaded case? Suppose the main thread > does sys_exit(0) and it has alive sub-threads. > > In this case pidfd_info() will report kinfo.exit_code = 0. > And this is probably fine if (file->f_flags & PIDFD_THREAD) != 0. Yes. > But what if this file was created without PIDFD_THREAD? If another > thread does exit_group(1) after that, the process's exit code is > 1 << 8, but it can't be retrieved. Yes, I had raised that in an off-list discussion about this as well and was unsure what the cleanest way of dealing with this would be. My initial approach had been to not just have: struct pidfs_exit_info { __u64 cgroupid; __s32 exit_code; }; but to have: struct pidfs_exit_info { __u64 cgroupid; __s32 exit_code; __u64 tg_cgroupid; __s32 tg_exit_code; }; so that it would be possible to retrieve either depending on the type of pidfd. Is that feasible? > Finally, sys_execve(). Suppose we have a main thread L and a > sub-thread T. > > T execs and kill the leader L. L exits and populates > pidfs_i(inode)->exit_info. > > T calls exchange_tids() in de_thread() and becomes the new leader > with the same (old) pid. > > Now, T is very much alive, but pidfs_i(inode)->exit_info != NULL. Yes, de_thread() is a good point. That nasty wrinkly I had really ignored^wforgotten. We should not report an exit status in this case. I think that's what you're agreeing with as well? > Or I am totally confused? No, you're right! What's the best way of handling the de_thread() case? Would moving this into exit_notify() be enough where we also handle PIDFD_THREAD/~PIDFD_THREAD waking? Thanks for the review! > > > > > + exit_info = READ_ONCE(pidfs_i(inode)->exit_info); > > + if (exit_info) { > > + /* > > + * TODO: Oleg, I didn't see a reason for putting > > + * retrieval of the exit status of a task behind some > > + * form of permission check. > > Neither me. > > Oleg. >