Daniel Lezcano <daniel.lezcano@xxxxxxx> writes: > Eric W. Biederman wrote: > If the normal rules of parentage apply, that means pid 0 has to wait it's child. > If we are in the scenario of pid 0, it's child pid 1234 and we kill the pid 1 of > the pid namespace, I suppose pid 1234 will be killed too. > The pid 0 will stay in the pid namespace and will able to fork again a new pid > 1. > > I think Serge already reported that... > > That sounds good :) I expect zap_pid_ns_processes should also arrange so we cannot allocate any more processes. We certainly need to do something explicit or pid 1 won't be allocated. It might make sense to resurrect a pid namespace after it's death but it is definitely weird. >> In a lot of ways I like this idea of sys_hijack/sys_cloneat, and I >> don't think anything I am doing fundamentally undermines it. The use >> case of doing things in fork is that there is automatic inheritance of >> everything. All of the namespaces and all of the control groups, and >> possibly also the parent process. > And also the rootfs for executing the command inside the container > (eg. shutdown), the uid/gid (if there is a user namespace), the mount points, > ... > But I suppose we can do the same with setns for all the namespaces and chrooting > within the container rootfs. > > What I see is a problem with the tty. For example, we cloneat the init process > of the container which is usually /sbin/init but this one has its tty mapped to > /dev/console, so the output of the exec'ed command will go to the console. My original thinking was that the fd's would come from the caller of sys_cloneat.... >> Overall it sounds like the semantics I have proposed with >> unshare(CLONE_NEWPID) are workable, and simple to implement. The >> extra fork is a bit surprising but it certainly does not >> look like a show stopper for implementing a pid namespace join. >> > I agree, it's some kind of "ghost" process. > IMO, with a bit of userspace code it would be possible to enter or exec a > command inside a container with nsfd, setns. > > +1 to test your patchset Eric :) I will see about reposting sometime soon. > Just a mindless suggestion, the "nsopen" / "nsattach" syscall names should be > more clear no ? Not bad suggestions. I am going to explore a bit more. Given that nsfd is using the same permission checks as a proc file, I think I can just make it a proc file. Something like "/proc/<pid>/ns/net". With a little luck that won't suck too badly. > Jumping back, one question about the nsfd and the poll for waiting the end of > the namespace. > If we have an openened file descriptor on a specific namespace, we grab a > reference on this one, so the namespace won't be destroyed until we close the fd > which is used to poll the end of the namespace, no ? Did I miss something ? Not really. The assumption was that there would be a very similar file descriptor that we could use with poll. Eric -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html