Re: [External] Re: [PATCH] pid_ns: support pidns switching between sibling

yunhui cui <cuiyunhui@xxxxxxxxxxxxx> · Fri, 13 Oct 2023 10:44:45 +0800

Hi Eric,

On Thu, Oct 12, 2023 at 11:31 AM Eric W. Biederman
<ebiederm@xxxxxxxxxxxx> wrote:
>
> Yunhui Cui <cuiyunhui@xxxxxxxxxxxxx> writes:
>
> > In the scenario of container acceleration,
>
> What is container acceleration?
>
> Are you perhaps performing what is essentially checkpoint/restart
> from one set of processes to a new set of processes so you can
> get a container starting faster?
Yeah, you are right .

>
> > when a target pstree is cloned from a temp pstree, we hope that the
> > cloned process is inherently in the target's pid namespace.
>
> I am having a hard time figuring out what you are saying here.

I think I need to describe in detail our needs and problems we face.
What we need to do is fork a container into a new container, which
means that all
processes of the original container need to be forked out and added to
the new container.
Then the forked process needs to be added to the namespace and cgroup
of the new container.

What we are talking about here is the pid namespace.

for example:
Assume that there are three processes A, B, and C in the original container.
What we need to do is A fork A_new, B fork B_new, C fork C_new.

However, in the existing pidns implementation, the parent process
first joins pidns, and then
the forked child process will get the new pidns (the pid of the child
process is what we expected),
and the parent process's own pidns has not actually changed (at least
pid is still existing).

To make A_new, B_new, and C_new inherently in the pidns of the new container,
A, B, and C must first switch to the pidns of the new container, right?
>From my understanding there is no better way to implement it.

But the existing implementation (the part to be changed in this patch)
is blocking our progress.

>
> > Examples of what we expected:
> >
> > /* switch to target ns first. */
> > setns(target_ns, CLONE_NEWPID);
>   ^-------- Is this the line that fails for you?
>
> > if(!fork()) {
> > /* Child */
> > ...
> > }
> > /* switch back */
> > setns(temp_ns, CLONE_NEWPID);
>
> Assuming that the "switch back" means returning to your
> task_active_pid_ns that should always work.

In the scenario I described, "switch back" would certainly work.

dst_pidns = open("/proc/%d/ns/pid");
src_pidns = open("/proc/self/ns/pid");

setns(dst_pidns, CLONE_NEWPID);
if(!fork()) {
/* Child */
/* The child process is born in the pidns of the new container. */
...
}
/* switch back */
setns(src_pidns, CLONE_NEWPID);

>
> If I had to guess I think what you are missing is that entire pid
> namespaces can be inside other pid namespaces.
>
> So there is no reason to believe that any random pid namespace
> that happens to pass the CAP_SYS_ADMIN permission check is also in
> your processes task_active_pid_ns.
>
>
> > However, it is limited by the existing implementation, CAP_SYS_ADMIN
> > has been checked in pidns_install(), so remove the limitation that only
> > by traversing parent can switch pidns.
>
> The check you are deleting is what verifies the pid namespaces you are
> attempting to change pid_ns_for_children to is a member of the tasks
> current pid namespace (aka task_active_pid_ns).
>
>
> There is a perfectly good comment describing why what you are attempting
> to do is unsupportable.
>
>         /*
>          * Only allow entering the current active pid namespace
>          * or a child of the current active pid namespace.
>          *
>          * This is required for fork to return a usable pid value and
>          * this maintains the property that processes and their
>          * children can not escape their current pid namespace.
>          */
>
>
> If you pick a pid namespace that does not meet the restrictions you are
> removing the pid of the new child can not be mapped into the pid
> namespace of the parent that called setns.
>
> AKA the following code can not work.
>
> pid = fork();
> if (!pid) {
>         /* child */
>         do_something();
>         _exit(0);
> }
> waitpid(pid);

Sorry, I don't understand what you mean here.

>
>
> So no.  The suggested change to pidns_install makes no sense at all.
>
> The whole not being able to escape your current pid namespace is
> also an important invariant when reasoning about pid namespaces.
>
> It would have to be a strong well thought out case for me to agree
> it makes sense to abandon the invariant that a process can not escape
> it's pid namespace.

I think we'd better have a good understanding of the problems we face first,
and then think of a more comprehensive way to solve it.
Although the modification of this patch is not perfect, do we have a better way?

Thanks,
Yunhui