Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.

Oren Laadan <orenl@xxxxxxxxxxxxxxx> · Wed, 03 Mar 2010 15:59:05 -0500

Daniel Lezcano wrote:
Eric W. Biederman wrote:
Pavel Emelyanov <xemul@xxxxxxxxxxxxx> writes:

Eric W. Biederman wrote:
Pavel Emelyanov <xemul@xxxxxxxxxxxxx> writes:

Thanks. What's the problem with setns?
joining a preexisting namespace is roughly the same problem as
unsharing a namespace.  We simply haven't figure out how to do it
safely for the pid and the uid namespaces.
The pid may change after this for sure. What problems do you know
about it? What if we try to allocate the same PID in a new space
or return -EBUSY? This will be a good starting point. If we manage
to fix it later this will not break the API at all.
Parentage.  The pid is the identity of a process and all kinds of things
make assumptions in all kinds of strange places.  I don't see how
waitpid can work if you change the pid.
Agree. But what if we enter a pid space, which is a subnamespace of a current
one? In that case parent will still see the task by its old pid. We can restrict
first version of entering with this rule as well and this restriction will not
block us in typical usecase (I mean enter a container from a host).
When I was thinking about pid namespaces and unshare last time.  The idea I came
to was we unshare of the pid namespace should only affect which pid namespace
your children are in.

I remember that do that there were a few cases where you would have to access
task->pid->pid_ns instead of task->nsproxy->pid_ns, but essentially it was pretty
simple.

glibc doesn't cope if you change someones pid.
OK, but what if we try to allocate the same pid returning -EBUSY on failure?

My aim is to provide even a restricted enter. For most of the cases this
should work and make our lives easier. So two restrictions currently:
a) enter a sub namespace
b) allocate the same pid as we have now

Hm? :)
Replacing struct pid is guaranteed to do all kinds of nasty things with
signal handling and the like, de_thread is nasty enough and you are talking
something worse.  So if we can change pid namespaces without changing
the pid I am for it.

I agree with all the points you and Pavel you talked about but I don't 
feel comfortable to have the current process to switch the pid namespace 
because of the process tree hierarchy (what will be the parent of the 
process when you enter the pid namespace for example). What is the 
difference with the sys_bindns or the sys_hijack, proposed a couple of 
years ago ?

I did a suggestion some weeks ago about a new syscall 'cloneat' where 
the child process becomes the child of the targeted process specified in 
the syscall. Maybe it would be interesting to replace the 'setns' by, or 
add, a 'cloneat' syscall with the file descriptor passed as parameter. 
The copy_process function shall not use the nsproxy of the caller but 
the one provided in the fd argument.

The newly created process becomes the child of the process where we 
retrieve the namespace with nsfd and this one have to 'waitpid' it, (the 
caller of 'cloneat' can not wait it). It's a bit similar with the 
CLONE_PARENT flag, except the creation order is inverted (the father 
creates for the child).

So when entering the container, we specify the pid 1 of the container 
which is usually a child reaper.

Does it make sense ?

For what it's worth, I think that this suggestion (cloneat) is the
so far the cleanest to allow a process to enter an existing namespace.

Oren.

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html