On Sun, Mar 02, 2025 at 06:21:49PM +0100, Oleg Nesterov wrote: > On 03/02, Christian Brauner wrote: > > > > On Sun, Mar 02, 2025 at 04:53:46PM +0100, Oleg Nesterov wrote: > > > On 02/28, Christian Brauner wrote: > > > > > > > > Some tools like systemd's jounral need to retrieve the exit and cgroup > > > > information after a process has already been reaped. > > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > > > > > But unless I am totally confused do_exit() calls pidfd_exit() even > > > before exit_notify(), the exiting task is not even zombie yet. It > > > will reaped only when it passes exit_notify() and its parent does > > > wait(). > > > > The overall goal is that it's possible to retrieve exit status and > > cgroupid even if the task has already been reaped. > > OK, please see below... > > > It's intentionally placed before exit_notify(), i.e., before the task is > > a zombie because exit_notify() wakes pidfd-pollers. Ideally, pidfd > > pollers would be woken and then could use the PIDFD_GET_INFO ioctl to > > retrieve the exit status. > > This was more a less clear to me. But this doesn't match the "the task has > already been reaped" goal above... > > > It would however be fine to place it into exit_notify() if it's a better > > fit there. If you have a preference let me know. > > > > I don't see a reason why seeing the exit status before that would be an > > issue. > > The problem is that it is not clear how can we do this correctly. > Especialy considering the problem with exec... > > > > But what if this file was created without PIDFD_THREAD? If another > > > thread does exit_group(1) after that, the process's exit code is > > > 1 << 8, but it can't be retrieved. > > > > Yes, I had raised that in an off-list discussion about this as well and > > was unsure what the cleanest way of dealing with this would be. > > I am not sure too, but again, please see below. > > > > Now, T is very much alive, but pidfs_i(inode)->exit_info != NULL. > > ... > > > What's the best way of handling the de_thread() case? Would moving this > > into exit_notify() be enough where we also handle > > PIDFD_THREAD/~PIDFD_THREAD waking? > > I don't think that moving pidfd_exit() into exit_notify() can solve any > problem. > > But what if we move pidfd_exit() into release_task() paths? Called when > the task is reaped by the parent/debugger, or if a sub-thread auto-reaps. > > Can the users of pidfd_info(PIDFD_INFO_EXIT) rely on POLLHUP from > release_task() -> detach_pid() -> __change_pid(new => NULL) ? Ok, so: release_task() -> __exit_signal() -> detach_pid() -> __change_pid() That sounds good. So could we do something like: diff --git a/kernel/exit.c b/kernel/exit.c index cae475e7858c..66bb5c53454f 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -127,8 +127,10 @@ static void __unhash_process(struct task_struct *p, bool group_dead) { nr_threads--; detach_pid(p, PIDTYPE_PID); + pidfs_exit(p); // record exit information for individual thread if (group_dead) { detach_pid(p, PIDTYPE_TGID); + pidfs_exit(p); // record exit information for thread-group leader detach_pid(p, PIDTYPE_PGID); detach_pid(p, PIDTYPE_SID); I know, as written this won't work but I'm just trying to get the idea across of recording exit information for both the individual thread and the thread-group leader in __unhash_process(). That should tackle both problems, i.e., recording exit information for both thread and thread-group leader as well as exec?