On Wed, Dec 07, 2022 at 09:39:00PM +0100, Oleg Nesterov wrote: > On 12/07, Frederic Weisbecker wrote: > > > > On Tue, Dec 06, 2022 at 05:49:28PM +0100, Oleg Nesterov wrote: > > > > > > At least I think it should not wait for the tasks injected into this ns. > > > > > > Because this looks like a kernel bug even if we forget about this deadlock. > > > > > I think this was made that way on purpose, > > Well maybe. But to me we have this behaviour only because we (me at least) > do not know how to avoid the "hang" in this case. > > > see the comment in zap_pid_ns_processes(): > > Heh ;) I wrote this comment in a53b83154914 ("exit: pidns: fix/update the > comments in zap_pid_ns_processes()") exactly because I didn't like this > behaviour, but I thought it must be documented. Bah! I should have guessed ;-) > > > I can't say I like the fact that a parent not belonging to a new namespace > > can create more than one child within that namespace > > not sure I understand but this looks fine and useful to me, I mean if only one task could be injected within a new namespace, we could be sure that all subsequent tasks belonging to that namespace would be descendents of that first task (the same way that every task in the default namespace is a descendant of the real init_task) and thus we wouldn't be bothered with such deadlocks. But I guess namespaces aren't designed to work like that. I don't know much about them so what I'm saying is very likely irrelevant. > > but anyway this all look like an ABI that can't be reverted now. > > perhaps... But you know, I wrote my previous email because 2 weeks ago > I had to investigate a bug report which blamed the kernel, while the > problem (unkillable process sleeping in zap_pid_ns_processes) was caused > by the dangling zombie injected into that process's namespace. And I am > still trying to convince the customer they need to fix userspace. Heh :-/ I wish we could fix this but I have no idea how. I guess the child_reaper of an ns could avoid waiting for the rest of the ns and designate its parent as the new child reaper. Or we could arrange for all tasks in the ns to autoreap if they ever fall back to be reaped by their ns->child_reaper and that child_reaper is dead. But that would look like ABI breakages... Thanks.