On Tue, Dec 06, 2022 at 05:49:28PM +0100, Oleg Nesterov wrote: > On 11/30, Eric W. Biederman wrote: > > > > 2) I keep thinking zap_pid_ns_processes() should be changed so that > > after it sends SIGKILL to all of the relevant processes to not wait, > > At least I think it should not wait for the tasks injected into this ns. > > Because this looks like a kernel bug even if we forget about this deadlock. > > Say we create a task P using clone(CLONE_NEWPID), then inject a task T into > P's pid-namespace via setns/fork. This make the process P "unkillable", it > will hang in zap_pid_ns_processes() "forever" until T->parent reaps a zombie > task T killed by P. I think this was made that way on purpose, see the comment in zap_pid_ns_processes(): /* * kernel_wait4() misses EXIT_DEAD children, and EXIT_ZOMBIE * process whose parents processes are outside of the pid * namespace. Such processes are created with setns()+fork(). * * If those EXIT_ZOMBIE processes are not reaped by their * parents before their parents exit, they will be reparented * to pid_ns->child_reaper. Thus pidns->child_reaper needs to * stay valid until they all go away. * * The code relies on the pid_ns->child_reaper ignoring * SIGCHILD to cause those EXIT_ZOMBIE processes to be * autoreaped if reparented. * * Semantically it is also desirable to wait for EXIT_ZOMBIE * processes before allowing the child_reaper to be reaped, as * that gives the invariant that when the init process of a * pid namespace is reaped all of the processes in the pid * namespace are gone. I can't say I like the fact that a parent not belonging to a new namespace can create more than one child within that namespace but anyway this all look like an ABI that can't be reverted now. Thanks.