Eric W. Biederman wrote: > Pavel Emelyanov <xemul@xxxxxxxxxxxxx> writes: > > >>> 2 parallel enters? I meant you have pid 0 in the entered pid namespace. >>> You have pid 0 because your pid simply does not map. >>> >> Oh, I see. >> >> >>> There is nothing that makes to parallel enters impossible in that. >>> Even today we have one thread per cpu that has task->pid == &init_struct_pid >>> which is pid 0. >>> >> How about the forked processes then? Who will be their parent? >> > > The normal rules of parentage apply. So the child will see simply > see it's parent as ppid == 0. If that child daemonizes it will become > a child of the pid namespaces init. > > This is a lot like something that gets started from call_usermodehelper. It's > parent process is not a descendant of init either. > > > The implementation of the join is to simply change current->nsproxy->pid_ns. > Then to use it you simply fork to get a child in the target pid namespace. > If the normal rules of parentage apply, that means pid 0 has to wait it's child. If we are in the scenario of pid 0, it's child pid 1234 and we kill the pid 1 of the pid namespace, I suppose pid 1234 will be killed too. The pid 0 will stay in the pid namespace and will able to fork again a new pid 1. I think Serge already reported that... That sounds good :) >>> For the case of unshare where we are designed to be used with PAM I don't >>> think my proposed semantics work. For a join needed an extra fork before >>> you are really in the pid namespace should be minor. >>> >> Hm... One more proposal - can we adopt the planned new fork_with_pids system >> call to fork the process right into a new pid namespace? >> > > In a lot of ways I like this idea of sys_hijack/sys_cloneat, and I > don't think anything I am doing fundamentally undermines it. The use > case of doing things in fork is that there is automatic inheritance of > everything. All of the namespaces and all of the control groups, and > possibly also the parent process. And also the rootfs for executing the command inside the container (eg. shutdown), the uid/gid (if there is a user namespace), the mount points, ... But I suppose we can do the same with setns for all the namespaces and chrooting within the container rootfs. What I see is a problem with the tty. For example, we cloneat the init process of the container which is usually /sbin/init but this one has its tty mapped to /dev/console, so the output of the exec'ed command will go to the console. > It does have the high cost that the > process we are copying from must be stopped because there are no locks > that let us take everything. I haven't looked at the recent proposals > to see if anyone has solved that problem cleanly. > Right. > If we can do a sys_hijack/sys_cloneat style of join, that means we can > afford a fork. At which point the my proposed pid namespace semantics > should be fine. > > aka: > setns(NSTYPE_PID); > pid = fork(); > if (pid == 0) { > getpid() == 2; /* Or whatever the first free pid is joined pid namespace */ > getppid() == 0; > } else { > pid == 6400; /* Or whatever the first free pid is in the original pid namespace */ > waitpid(pid); > } > > >>> That doesn't handle the case of cached struct pids. A good example is >>> waitpid, where it waits for a specific struct pid. Which means that >>> allocating a new struct pid and changing task->pid will cause >>> waitpid(pid) to wait forever... >>> >> OK. Good example. Thanks. >> >> >>> To change struct pid would require the refcount on struct pid to show >>> no references from anywhere except the task_struct. >>> >> I think this is OK to return -EBUSY for this. And fix the waitpid >> respectively not to block this common case. All the others I think >> can be stayed as is. >> > > That would probably work. setsid() and setpgrp() have similar sorts > of restrictions. That is both more challenging and more limiting than > the semantics that come out of my unshare(CLONE_NEWPID) patch. So I > would prefer to keep this sort of thing as a last resort. > > >>> At the cost of a little memory we can solve that problem for unshare >>> if we have a an extra upid in struct pid, how we verify there is space >>> in struct pid I'm not certain. >>> >>> I do think that at least until someone calls exec the namespace pids are >>> reported to the process itself should not change. That is kill and >>> >> Wait a second - in that case the wait will be blocked too! No? >> > > If all we do is populate an unused struct upid in struct pid there > isn't a chance of a problem. > > >>> waitpid etc. Which suggests an implementation the opposite of what >>> I proposed. With ns_of_pid(task_pid(current)) being used as the >>> pid namespace of children, and current->nsproxy->pid_ns not changing >>> in the case of unshare. >>> >>> Shrug. >>> >>> Or perhaps this is a case where we use we can implement join with >>> an extra process but we can't implement unshare, because the effect >>> cannot be immediate. >>> >> Well, I'm talking only about the join now. >> > > Overall it sounds like the semantics I have proposed with > unshare(CLONE_NEWPID) are workable, and simple to implement. The > extra fork is a bit surprising but it certainly does not > look like a show stopper for implementing a pid namespace join. > I agree, it's some kind of "ghost" process. IMO, with a bit of userspace code it would be possible to enter or exec a command inside a container with nsfd, setns. +1 to test your patchset Eric :) Just a mindless suggestion, the "nsopen" / "nsattach" syscall names should be more clear no ? Jumping back, one question about the nsfd and the poll for waiting the end of the namespace. If we have an openened file descriptor on a specific namespace, we grab a reference on this one, so the namespace won't be destroyed until we close the fd which is used to poll the end of the namespace, no ? Did I miss something ? Thanks -- Daniel _______________________________________________ Containers mailing list Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx https://lists.linux-foundation.org/mailman/listinfo/containers