Quoting Mike Waychison (mikew@xxxxxxxxxx): > Serge E. Hallyn wrote: >> Quoting Eric W. Biederman (ebiederm@xxxxxxxxxxxx): >>> Ok. I see what you are trying to accomplish with this and honestly I >>> think it is silly. >>> >>> We should start the threads we need in the kernel, and if we need to >>> run clone_pid fine. I am not comfortable exporting clone_with_pid to >>> user space. >> >> Even if we create the task tree in userspace, I don't see why we >> can't have the parent of each nested pid_ns pass CLONE_NEWPID to >> clone_with_pid() instead of doing clone first and then unsharing >> the pidns? >> >> As for clone_with_pid(), I don't particularly like the semantics, >> but as was discussed over IRC, we could have clone_with_pid() >> return -EINVAL unless it is called while it is called from a task >> inside a restarting container. (and -EPERM if setting a pid in >> a pid_ns which was not created as part of the container) Eric >> do you dislike that any less? > > Wouldn't this mean the kernel would have to track which namespaces are > part of a restart and which aren't? Seems a little kludgy to me. Well it could do that, which would be trivial since it knows the hierarchy, and knows which ns the init process of the whole restarted container is. Or, we can just, as I suggested before, tag the pid_ns with the uid of the task which created it. Then a restart can theoritically specify a pid_t in a clone_with_pid() for a pid_ns pre-existing to the restart, but it can still only do it for pid namespaces which it "owns." Whatever we do, we just need to make sure that an unprivileged task can't manipulate a checkpoint image so as to pick an id in the init_pid_ns, as that can be perceived as potentially exacerbating vulnerabilities in poorly written userspace programs as Linus mentioned. -serge _______________________________________________ Containers mailing list Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx https://lists.linux-foundation.org/mailman/listinfo/containers